Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

Authors: Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluation demonstrates the effectiveness of our method. As shown in Figure 1, compared to the original Infinity-8B model, Scale KV achieves negligible quality degradation (Gen Eval score remains at 0.79 and DPG score decreases slightly from 86.61 to 86.49) while requiring merely 10% of the original GPU memory consumption. These results validate that Scale KV effectively addresses the fundamental memory bottlenecks that have constrained the practical deployment of VAR models. 4 Experiments 4.1 Experimental Setup 4.2 Main Results 4.3 Analytical Experiments
Researcher Affiliation	Academia	University of Washington National University of Singapore EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using prose and mathematical equations such as for the Attention Selectivity Index (ASI) in Section 3.3, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/StargazerX0/ScaleKV.
Open Datasets	Yes	We assessed output consistency with the original models using the MS-COCO 2017 [32] validation set, which comprises 5,000 images and captions. We use GPT-4o [21] to generate 10 prompts for calibrating drafters and refiners. To evaluate the stability of our drafter/refiner identification method across varying calibration set sizes, we conducted an ablation study using prompts sampled from the LAION-Art dataset.
Dataset Splits	Yes	We assessed output consistency with the original models using the MS-COCO 2017 [32] validation set, which comprises 5,000 images and captions.
Hardware Specification	Yes	Our method achieves up to 1.25 speedup on a single NVIDIA H20 GPU, with performance gains becoming more pronounced as resolution increases.
Software Dependencies	No	The paper mentions the use of 'GPT-4o [21]' for generating prompts but does not specify version numbers for any other software dependencies, libraries, or frameworks used in their implementation.
Experiment Setup	Yes	We evaluated Scale KV on two VAR-based text-to-image models of different capacities: Infinity-2B and Infinity-8B [16], to validate our method’s generalizability across model scales. We analyzed performance under three memory budget constraints: 4%, 10%, and 20% of the original KV cache size. For memory efficiency, we report the KV cache memory usage measured with a batch size of 8. We use GPT-4o [21] to generate 10 prompts for calibrating drafters and refiners. With an initial refiner budget of 600 tokens, we observe a consistent improvement in FID from 3.49 to 2.53 as decay rate increases from 0 to 70, confirming our observation that refiner attention becomes increasingly focused at higher scales, requiring fewer resources.