Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

AdmTree: Compressing Lengthy Context with Adaptive Semantic Trees

Authors: Yangning Li, Shaoshen Chen, Yinghui Li, Yankai Chen, Hai-Tao Zheng, Hui Wang, Wenhao Jiang, Philip S Yu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments are conducted to validate the effectiveness of Adm Tree. In main experiments, Adm Tree consistently achieves state-of-the-art performance across five task types and more than ten datasets from Long Bench [4], surpassing baseline methods by over 10% while maintaining high inference efficiency.
Researcher Affiliation	Academia	1Shenzhen International Graduate School, Tsinghua University 2Peng Cheng Laboratory 3University of Illinois Chicago 4Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)
Pseudocode	No	The paper describes the Adm Tree framework and its components (Adaptive Leaf Gist Token Construction, Semantic Tree Construction, Tree-based Compression) in detail using text and mathematical formulations. However, it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code	Yes	we will release our code and model in Github with well-documented readme.
Open Datasets	Yes	We evaluate Adm Tree s effectiveness and efficiency on Long Bench [4]. Long Bench is a multitask benchmark assessing LLM comprehension of long contexts. The description for other used dataset can be found in Appendix. In training Adm Tree, we use the same dataset as Zhang et al. [58] to ensure fairness. Specifically, pre-training is conducted on 1B tokens sampled from Red Pajama [53]. During fine-tuning, we utilize Long Alpaca [9], Book Sum [31], and 16K synthetic samples generated by GPT-3.5 [18].
Dataset Splits	No	The paper mentions using well-known public datasets and benchmarks for evaluation, such as Long Bench, Needle-In-The-Haystack, Share GPT (a curated subset as test set), Arxiv-March23 (a curated subset of 500 samples as test set), and MSC. For training, it states "pre-training is conducted on 1B tokens sampled from Red Pajama [53]. During fine-tuning, we utilize Long Alpaca [9], Book Sum [31], and 16K synthetic samples generated by GPT-3.5 [18]." However, it does not provide specific percentages or counts for training, validation, and test splits for these training datasets to allow for direct reproducibility of their experimental data partitioning.
Hardware Specification	Yes	Adm Tree is trained with a batch size of 8 on 8 NVIDIA A800 GPUs, with learning rates set to 5e-5 for pre-training and 1e-5 for fine-tuning, respectively.
Software Dependencies	No	The paper mentions using "Lla MA-2-7B-Chat and Qwen-2-7B-Instruct as the backbone LLMs." However, it does not specify versions for any other key software components, libraries (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python).
Experiment Setup	Yes	Adm Tree is trained with a batch size of 8 on 8 NVIDIA A800 GPUs, with learning rates set to 5e-5 for pre-training and 1e-5 for fine-tuning, respectively.