OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

1Stability AI

OCTOPUS rotates keys, groups them into triplets, and encodes each direction with an octahedral map — then decodes directly in attention. Codec geometry matters most as bits fall.

Abstract

The key-value (KV) cache dominates memory bandwidth and footprint in long-context autoregressive inference. Recent rotation-preconditioned codecs (TurboQuant, PolarQuant) show that a structured random rotation followed by a per-coordinate scalar quantizer matched to an analytically tractable marginal is a near-optimal recipe for KV compression. OCTOPUS advances this paradigm through joint quantization of rotated coordinate triplets. Each triplet's direction is mapped to a square via an octahedral parameterization, and the two resulting coordinates and the triplet norm are Lloyd-Max quantized against implementation-matched marginals. Optimizing the per-triplet squared error gives a strictly non-uniform bit allocation depending only on the total dimensionality of the keys. We find the finite-dimensional quality optimum with sweeps to be constant on every real decoder we test. The codec is data-oblivious, online, and deterministic given a seed. Across text, video, and audio, OCTOPUS matches or beats every prior rotation codec at every reported bit width and metric, with a lead that grows as bits drop for extreme compression. Furthermore, a fused Triton implementation reconstructs keys on the fly without materializing the uncompressed key, so the codec adds no decode-time bandwidth or latency over the existing dequantization.

Interactive Demo

Reduce the KV bit budget with the slider and pick a quantization method. Compare against the uncompressed baseline on video (CausVid, Causal Forcing) and audio (AAR + CLAP conditioning on 10s AudioSet clips).

OCTOPUS @ 4 bits

Baseline (bf16 KV)

Results

We evaluate on language (Qwen2.5-7B-Instruct-1M: WikiText-2 PPL and multi-key needle-in-a-haystack recall up to 128K), video (Wan-1.3B DiT with CausVid and Causal Forcing), and audio (AAR). At 2 bits, scalar rotation baselines often collapse while OCTOPUS remains usable.

Language modeling

Video generation

Side-by-side at matched K=V bit width. TurboQuant-QJL often collapses at 2b (dark or noisy output).

Audio generation

BibTeX

@article{boss2026octopus,
  title={OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization},
  author={Boss, Mark and Voleti, Vikram and Donn{\'e}, Simon and Vainer, Shimon},
  journal={arXiv preprint},
  year={2026}
}