TurboQuant

1 post tagged with this.

TurboQuant compresses LLM key-value caches to 3 bits with no accuracy loss — 8x throughput on H100 GPUs and zero training required.