Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Activation Control for Efficiently Eliciting Long Chain-of-thought Ability of Language Models

Authors: Zekai Zhao, Qi Liu, Kun Zhou, Zihan Liu, Yifei Shao, Zhiting Hu, Biwei Huang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments have verified the effectiveness of our methods in efficiently eliciting the long Co T ability of LLMs and improving the performance. Besides, we further propose a parameter-efficient fine-tuning method that trains only the last-layer activation amplification module and a few Lo RA layers, outperforming Lo RA on reasoning benchmarks with much fewer parameters. Our code and data are fully public released at https://github.com/Zekai Z123/EELo-Co T.
Researcher Affiliation	Academia	Zekai Zhao , Qi Liu , Kun Zhou , Zihan Liu, Yifei Shao, Zhiting Hu, Biwei Huang University of California, San Diego. EMAIL
Pseudocode	No	The paper describes methodologies and functions (e.g., f(t) = a b log(t + c)) but does not present them in a structured pseudocode or algorithm block format. Procedures are explained in narrative text.
Open Source Code	Yes	Our code and data are fully public released at https://github.com/Zekai Z123/EELo-Co T.
Open Datasets	Yes	Datasets. We select the following three benchmarks for evaluation: MATH [15]: it consists of 500 high-school level math problems across algebra, geometry, calculus, and number theory. AMC23: it consists of problems from the 2023 American Mathematics Competitions (AMC 10 and AMC 12), covering challenging multi-choice problems designed for high school students. GPQA-Diamond [18]: GPQA benchmark focuses on high-complexity questions. We select the Diamond split that includes only the most difficult examples. LIMO dataset [20].
Dataset Splits	Yes	We visualize the accuracy and self-reflection rate in the test set of MATH dataset. For Math500 and GPQA, we evaluate correctness via greedy decoding with a single sample per question. For AMC23, we generate 16 samples per question with a temperature of 0.7 and compute the unbiased pass@1 metric as proposed by [21].
Hardware Specification	Yes	Our fine-tuning process completed within 8 hours on 8 NVIDIA A100 GPUs.
Software Dependencies	No	All evaluations are accelerated using v LLM [22] for efficient inference.
Experiment Setup	Yes	For intervention, we use the following amplification factors, i.e., 1.2, 1.4, and 1.6, and larger factors would cause the generation unstable. We set the minimum number of digits k in the last sentence to insert wait token in the starting position of the next sentence as 5. The cool-down window setting that temporarily locks down the forcing reflection is set as 4. The number of activations we used are 150 with amplification factor set as 4. For the Lo RA baseline, we set the rank to 256 and the scaling factor α to 512, applying Lo RA to all eligible layers in the model. In contrast, our method adopts a more parameter-efficient design by using a lower rank of 64 on the first 63 decoder layers. Additionally, we inject an Activation Amplification Module into the final MLP layer. All original model parameters, except parameters in Lo RA and Amplification Module are frozen. The number of amplified key activations, n, is set to 100. For AMC23, we generate 16 samples per question with a temperature of 0.7 and compute the unbiased pass@1 metric as proposed by [21].