Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

The Overthinker's DIET: Cutting Token Calories with DIfficulty-AwarE Training

Authors: Weize Chen, Jiarui yuan, Jin Tailin, Ning Ding, Huimin Chen, Zhiyuan Liu, Maosong Sun

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that DIET significantly reduces token counts while simultaneously improving reasoning performance. ... 4 Experimental Validation
Researcher Affiliation Academia Weize Chen , Jiarui Yuan , Tailin Jin, Ning Ding, Huimin Chen, Zhiyuan Liu , Maosong Sun Tsinghua University EMAIL, EMAIL
Pseudocode No The paper describes methods and processes but does not include any clearly labeled pseudocode or algorithm blocks. Figure 1 is an overview diagram, not pseudocode.
Open Source Code Yes Our code is available at https://github.com/thunlp/DIET.
Open Datasets Yes We use the Deep Scale R dataset (Luo et al., 2025b), featuring high-quality mathematical problems of diverse complexities. ... We assess Pass@1 (P@1) and average response length (Tokens, Tok) on MATH 500 (Hendrycks et al., 2021), AIME 2024, AMC 2023, Olympiad Bench (He et al., 2024), and Minerva (Lewkowycz et al., 2022).
Dataset Splits No The paper states, "We sample 32 samples for each question in AIME24, and 10 samples for others to estimate the P@1." This describes sampling for evaluation purposes on test benchmarks, but it does not specify explicit training/validation/test splits for the primary Deep Scale R dataset used for training, nor for how their own data was partitioned into these sets.
Hardware Specification Yes train models on 8 A100 GPUs.
Software Dependencies No We use ve RL (Sheng et al., 2024) as the training framework, and train models on 8 A100 GPUs. ... During the Inference Scaling evaluation of mathematical problems, we utilized Python s sympy3 module to ascertain the equivalence of two mathematical formulas in La Te X format.
Experiment Setup Yes For our RL-based methods, in the rollout phase, we set the number of rollouts to 8, with a top-p value of 0.95, a temperature of 0.6, and a maximum response length of 8192 tokens. During the training phase, we set αbase in Eq. (4) to 0.5, half-cycle of Cyclical Compression Pressure to 100, kl loss coefficient to 0.001, the learning rate to 1e-6, and the batch size to 128.