Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Vision-centric Token Compression in Large Language Model

Authors: Ling Xing, Alex Jinpeng Wang, Rui Yan, Xiangbo Shu, Jinhui Tang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On eleven in-context learning benchmarks, VIST achieves the same accuracy with 2.3 fewer tokens, cutting FLOPs by 16% and memory by 50%. This method delivers remarkable results, outperforming the strongest text encoder-based compression method CEPE by 7.6% on average over benchmarks like Trivia QA, NQ, Pop QA, NLUI, and CLIN, setting a new standard for token efficiency in LLMs. To validate the effectiveness of VIST, we primarily compare with the text-encoder-based token compression counterpart CEPE [21]. Section 4 is titled "Experiment" and includes subsections such as "Experimental Setup", "Long-context Language Modeling", "In-context Learning", "Open-domain Question Answering", and "Ablation Study", all of which report performance metrics and comparisons in tables and figures.
Researcher Affiliation Academia Ling Xing1, Alex Jinpeng Wang2, Rui Yan1, Xiangbo Shu1, Jinhui Tang3 1Nanjing University of Science and Technology 2Central South University 3Nanjing Forestry University
Pseudocode No The paper describes the methodology in narrative text and figures (e.g., Figure 2 for the overall pipeline) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code Yes The project is at https://github.com/CSU-JPG/VIST.
Open Datasets Yes Our pertaining dataset is an official sample of the Red Pajama dataset [58], including 1B tokens from seven domains: Ar Xiv, Book, C4, Commoncrawl, Git Hub, Stack Exchange, and Wikipedia. We evaluate on Ar Xiv and Book from Red Pajama [58] test split, alongside long-context datasets: PG19 [59], Proof [60], and Code [61]. We evaluate VIST on In-Context Learning (ICL) tasks across 11 widely-used text-classification datasets: SST2[62], MR[63], AGN[64], SST5[62], TREC, TREF[65], DBP[64], NLUS, NLUI[66], BANK [67], and CLIN [68]. Experiments are conducted on three open-domain QA datasets, including Trivia QA [69], NQ [70], and Pop QA [71]. The data used in this paper is publicly available.
Dataset Splits Yes We evaluate on Ar Xiv and Book from Red Pajama [58] test split... The evaluation metric is perplexity (PPL) over the last 256 tokens of each input. Following [21], we randomly sample 250 text examples per dataset. The ICL results in Table 2 are reported as the average accuracy over three random seeds. Appendix B.1: For Long-context Language Modeling, we sample 5000 sequences for each dataset.
Hardware Specification Yes We conduct the experiments on NVIDIA V100 GPUs. VIST significantly increases the in-context text length from 16k to 64k at inference stage on a single 24GB RTX 4090 GPU, compared to CEPE.
Software Dependencies No We validate VIST with Tiny Llama [52]. The frozen vision encoder in our model is Vi TL/14 [11] from Open CLIP. To reduce computational overhead, our model employs float16 precision and Deep Speed Zero-2 with CPU off-loading[53]. The text encoder in CEPE follows the configuration of Ro BERTa-large [76], as in CEPE [21]. Table 8: Learning rate scheduler Cosine decay [74], Optimizer Adam W [75].
Experiment Setup Yes We provide the data and optimization hyperparameters during the pre-training of VIST in Table 8. Table 8 details: Image size (height, width) (224, 224), Image mode RGB, Font Google Noto Sans, Font size 10, Peak learning rate 3.0e-4, Warmup ratio 4%, Learning rate scheduler Cosine decay [74], Optimizer Adam W [75] with β1 0.9, β2 0.999, ϵ 10-8, Mixed precision training fp16, Number of steps 2000 steps. The τ for PVE is 0.07. The weight of PVE is 1.