Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression

Authors: Chenlong Deng, Zhisong Zhang, Kelong Mao, Shuaiyi Li, Tianqing Fang, Hongming Zhang, Haitao Mi, Dong Yu, Zhicheng Dou

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments across multiple long-context tasks demonstrate that Uni Gist significantly improves compression quality, with especially strong performance in detail-recalling tasks and long-range dependency modeling. ... 5 Experiments ... Table 1 presents the overall evaluation results. We highlight three key observations: (1) Uni Gist achieves the best performance among all compression methods across both model sizes.
Researcher Affiliation	Collaboration	Chenlong Deng1 , Zhisong Zhang3 , Kelong Mao1, Shuaiyi Li2, Tianqing Fang2, Hongming Zhang2, Haitao Mi2, Dong Yu2, Zhicheng Dou1 1Renmin University of China 2Tencent AI Lab 3City University of Hong Kong EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes through textual descriptions and figures (e.g., Figure 2, Figure 3), but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: we plan to release our code in the future.
Open Datasets	Yes	Continued pretraining is conducted on 16B tokens of 32K-length samples drawn from Prolong s [14] mixed dataset... HELMET benchmark [48]... Inf Bench [55] and RULER [19]... MMLU-Pro [39] (knowledge and reasoning), GSM8K [7] (math), and Hella Swag [52] (commonsense inference)... Magpie-Llama-3.1-Pro-MT-300K-Filtered dataset [45] as the main source. Then, we refer to the Beacon s [54] setup to further augment it with samples from Long Alpaca [5], Book Sum [21], and 500 RULER-like synthetic data designed for long-context tasks.
Dataset Splits	No	The paper mentions restructuring a 64K-length dataset into 32K segments for continued pretraining and uses various benchmarks (e.g., HELMET, MMLU-Pro) that typically have predefined splits. However, it does not explicitly state the train/test/validation splits (e.g., in percentages or sample counts) used for the experiments described in the paper.
Hardware Specification	No	The paper mentions evaluating peak GPU memory usage and that 'All custom attention kernels are implemented in Triton.' It also mentions 'a controlled setting with batch size 1, 32 attention heads, and a head dimension of 128' in Appendix C. However, it does not specify exact GPU models (e.g., NVIDIA A100), CPU models, or other specific hardware configurations used for the experiments.
Software Dependencies	No	The paper states, 'All custom attention kernels are implemented in Triton.' and 'All training and inference experiments are conducted using the Huggingface framework.' However, it does not provide specific version numbers for Triton, Huggingface, or any other software dependencies, which are necessary for reproducible descriptions.
Experiment Setup	Yes	Implementation Details. We use Llama3.1-8B-Instruct and Llama-3.2-3B-Instruct as the base models... Continued pretraining is conducted on 16B tokens of 32K-length samples... followed by supervised tuning on 1B tokens... Cross-document masking is applied... Greedy decoding is used... For Uni Gist, the sink size is set to 128, and the local window corresponds to 128 raw tokens. The main results reported use a compression ratio of 4. ... A.2 Hyper-parameters For continued pretraining, we use a batch size of 2M tokens and set the learning rate to 1e-5. The learning rate is warmed up linearly from 0 over 256 steps and then decayed to 50% of its peak using cosine scheduling. The Adam W optimizer is used... During fine-tuning, we retain most hyperparameters but reduce the warm-up steps to 128.