Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training

Authors: Weize Chen, Jiarui yuan, Jin Tailin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. ... 4 Experimental Validation
Researcher Affiliation	Academia	Weize Chen , Jiarui Yuan , Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu , Maosong Sun Tsinghua University EMAIL, EMAIL
Pseudocode	No	The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks. Figure 1 is an overview diagram, not pseudocode.
Open Source Code	Yes	Our code is available at https://github.com/thunlp/DIET.
Open Datasets	Yes	We use the Deep Scale R dataset (Luo et al., 2025b), featuring high-quality mathematical problems of diverse complexities. ... We assess Pass@1 (P@1) and average response length (Tokens, Tok) on MATH 500 (Hendrycks et al., 2021), AIME 2024, AMC 2023, Olympiad Bench (He et al., 2024), and Minerva (Lewkowycz et al., 2022).
Dataset Splits	No	The paper states, "We sample 32 samples for each question in AIME24, and 10 samples for others to estimate the P@1." This describes sampling for evaluation purposes on test benchmarks, but it does not specify explicit training/validation/test splits for the primary Deep Scale R dataset used for training, nor for how their own data was partitioned into these sets.
Hardware Specification	Yes	train models on 8 A100 GPUs.
Software Dependencies	No	We use ve RL (Sheng et al., 2024) as the training framework, and train models on 8 A100 GPUs. ... During the Inference Scaling evaluation of mathematical problems, we utilized Python s sympy3 module to ascertain the equivalence of two mathematical formulas in La Te X format.
Experiment Setup	Yes	For our RL-based methods, in the rollout phase, we set the number of rollouts to 8, with a top-p value of 0.95, a temperature of 0.6, and a maximum response length of 8192 tokens. During the training phase, we set αbase in Eq. (4) to 0.5, half-cycle of Cyclical Compression Pressure to 100, kl loss coefficient to 0.001, the learning rate to 1e-6, and the batch size to 128.