Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

System-1.5 Reasoning: Traversal in Language and Latent Spaces with Dynamic Shortcuts

Authors: Xiaoqiang Wang, Suyuchen Wang, Yun Zhu, Bang Liu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on reasoning tasks demonstrate the superior performance of our method. For example, on GSM8K, System-1.5 Reasoning achieves reasoning performance comparable to traditional Co T fine-tuning methods while accelerating inference by over 20 and reducing token generation by 91.0% on average.
Researcher Affiliation	Academia	Xiaoqiang Wang1,2 Suyuchen Wang1,2 Yun Zhu Bang Liu1,2,3 1DIRO & Institut Courtois, Université de Montréal 2Mila Quebec AI Institute; 3Canada CIFAR AI Chair EMAIL EMAIL
Pseudocode	No	The paper describes the methodology and training process using textual descriptions and mathematical equations (e.g., Eq. 1-13) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We provide all necessary details regarding the experimental setup, implementation, and hyperparameters in Section 3 and Appendix B to support reproducibility for reviewer assessment. While the code and data are not publicly available at this stage, we plan to release them upon acceptance.
Open Datasets	Yes	We evaluate the effectiveness and efficiency of System-1.5 Reasoning on two reasoning-intensive tasks: mathematical reasoning and common sense reasoning. For mathematical reasoning, we train on the augmented GSM8K dataset (Deng et al., 2023), which extends the original GSM8K (Cobbe et al., 2021) with a larger set of grade school-level math problems. ... Additionally, for mathematical reasoning, we conduct out-of-domain evaluation on GSM-HARD (Gao et al., 2023), a dataset with increased reasoning difficulty ... For commonsense reasoning, we use Strategy QA (Geva et al., 2021)...
Dataset Splits	Yes	We train models on the official training splits and evaluate performance on the respective test sets for in-domain evaluation. Additionally, for mathematical reasoning, we conduct out-of-domain evaluation on GSM-HARD (Gao et al., 2023), a dataset with increased reasoning difficulty designed to test generalization beyond the original GSM8K. ... For both tasks, we train System-1.5 Reasoning on the respective training splits and evaluate performance on the corresponding test sets to assess in-domain effectiveness.
Hardware Specification	Yes	Training is performed on a single NVIDIA RTX A5000 (24 GB) GPU, requiring approximately 26 hours for LLa MA 3.2 1B and 5 hours for GPT-2 (124M) for an 8-epoch run.
Software Dependencies	No	We developed our method using Py Torch (Paszke et al., 2019). The base models GPT-2 124M (Radford et al., 2019) and LLa MA 3.1-1B (Grattafiori et al., 2024) are initialized from pretrained checkpoints provided by the Hugging Face Transformers library (Wolf et al., 2020).
Experiment Setup	Yes	During shortcut learning, we insert a router-adapter module into each Transformer layer. The router module is implemented as a feed-forward network (FFN) followed by a sigmoid activation. The adapter module is implemented using Lo RA (Hu et al., 2021), with a scaling factor α = 32, rank r = 8, and a dropout rate of 0.1. Unless otherwise noted in our test-time scaling analysis (Section 3.2), we set the default depth exit threshold λdepth to 0.6 and the decoding step count λstep to 2. We set the loss coefficient for the language-to-latent alignment (Eq. 9) to α = 1.0, and similarly, the loss coefficient for shortcut learning (Eq. 13) to β = 1.0. Fine-tuning is conducted for 8 epochs using the Adam W optimizer (Loshchilov & Hutter, 2018), with a maximum learning rate of 2 10 5, β1 = 0.9, β2 = 0.99, and a warmup over 6% of total training steps. We use a batch size of 2. Training is performed on a single NVIDIA RTX A5000 (24 GB) GPU, requiring approximately 26 hours for LLa MA 3.2 1B and 5 hours for GPT-2 (124M) for an 8-epoch run. All experiments are conducted over four independent runs with different random seeds. We report the average results across these runs to ensure statistical stability.