Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Authors: Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluation demonstrates the effectiveness of our method. As shown in Figure 1, compared to the original Infinity-8B model, Scale KV achieves negligible quality degradation (Gen Eval score remains at 0.79 and DPG score decreases slightly from 86.61 to 86.49) while requiring merely 10% of the original GPU memory consumption. These results validate that Scale KV effectively addresses the fundamental memory bottlenecks that have constrained the practical deployment of VAR models. 4 Experiments 4.1 Experimental Setup 4.2 Main Results 4.3 Analytical Experiments
Researcher Affiliation Academia University of Washington National University of Singapore EMAIL, EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical equations such as for the Attention Selectivity Index (ASI) in Section 3.3, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/StargazerX0/ScaleKV.
Open Datasets Yes We assessed output consistency with the original models using the MS-COCO 2017 [32] validation set, which comprises 5,000 images and captions. We use GPT-4o [21] to generate 10 prompts for calibrating drafters and refiners. To evaluate the stability of our drafter/refiner identification method across varying calibration set sizes, we conducted an ablation study using prompts sampled from the LAION-Art dataset.
Dataset Splits Yes We assessed output consistency with the original models using the MS-COCO 2017 [32] validation set, which comprises 5,000 images and captions.
Hardware Specification Yes Our method achieves up to 1.25 speedup on a single NVIDIA H20 GPU, with performance gains becoming more pronounced as resolution increases.
Software Dependencies No The paper mentions the use of 'GPT-4o [21]' for generating prompts but does not specify version numbers for any other software dependencies, libraries, or frameworks used in their implementation.
Experiment Setup Yes We evaluated Scale KV on two VAR-based text-to-image models of different capacities: Infinity-2B and Infinity-8B [16], to validate our methodโ€™s generalizability across model scales. We analyzed performance under three memory budget constraints: 4%, 10%, and 20% of the original KV cache size. For memory efficiency, we report the KV cache memory usage measured with a batch size of 8. We use GPT-4o [21] to generate 10 prompts for calibrating drafters and refiners. With an initial refiner budget of 600 tokens, we observe a consistent improvement in FID from 3.49 to 2.53 as decay rate increases from 0 to 70, confirming our observation that refiner attention becomes increasingly focused at higher scales, requiring fewer resources.