Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Recursive Transformer: Boosting Reasoning Ability with State Stack

Authors: Kechi Zhang, Ge Li, Huangzhao Zhang, Yihong Dong, Jia Li, Jingjing Xu, Zhi Jin

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our comprehensive evaluation spans benchmarks for both Chomsky hierarchy and large-scale natural languages. Across these diverse tasks, STACKTRANS consistently outperforms standard Transformer models and other baselines. We have successfully scaled STACKTRANS up from 360M to 7B parameters. In particular, our from-scratch pretrained model STACKTRANS-360M outperforms several larger open-source LLMs with 2 3 more parameters, showcasing its superior efficiency and reasoning capability. We conduct comprehensive experiments on multiple benchmarks spanning both formal languages [Delétang et al., 2022] and natural languages [Groeneveld et al., 2024]. Evaluation Results Table 1 shows that STACKTRANS consistently outperforms the standard Transformer, particularly on RE and DCF tasks. We evaluate all the variants introduced above on the V2 and V3 validation sets [Zhu et al., 2024]. Experimental details are provided in G. The ablation results are presented in Table 3.
Researcher Affiliation	Collaboration	1Key Lab of High Confidence Software Technology (PKU), Ministry of Education 2School of Computer Science, Peking University, China 3School of Computer Science, Wuhan University, China 4College of AI, Tsinghua University 5Byte Dance
Pseudocode	No	The paper describes the proposed methods using detailed textual explanations and mathematical equations (e.g., Equations 1-4) along with an architectural diagram (Figure 1a and Figure 2). However, it does not include a distinct section or figure explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	In particular, our open-sourced STACKTRANS-360M, which is pretrained on a corpus of approximately 1T tokens, performs better than or comparably to state-of-the-art LLMs with 2 3 more parameters, as shown in Figure 1(c). We will provide the key code for dataset processing and training in the supplemental material.
Open Datasets	Yes	Our corpora come from Dolma [Soldaini et al., 2024] and Smoll [Allal et al., 2025], which contain high-quality natural language, math, and Python code examples with diverse domains. We follow the OLMo framework [Allen AI, 2024] to pretrain STACKTRANS. We assess the downstream capabilities, we evaluate STACKTRANS-360M on a comprehensive suite of widely-used benchmarks, and details are shown in E. As listed in Table 2, STACKTRANS-360M outperforms all baseline models, including those with significantly larger parameter sizes. Notably, it achieves substantial gains on GSM8K and ARC, highlighting its strength in reasoning tasks that require compositional generalization, recursion, or latent state management.
Dataset Splits	Yes	STACKTRANS is trained on sequences with input length uniformly sampled from 1 to 40 tokens. At test time, we evaluate STACKTRANS on sequences with significantly longer lengths up to 500 tokens, thereby measuring its length generalization. Following the same procedure as Delétang et al. [2022], token-level accuracy is used as the evaluation metric.
Hardware Specification	No	To keep a fair comparison, all comparisons are conducted on the same hardware setup. We measure both training and inference time over 100 consecutive steps under identical hyperparameter and batch size configurations. Constrained by computational resources, we limit our final pre-trained model to 360M parameters and use approximately 1 trillion training tokens.
Software Dependencies	No	We follow the OLMo framework [Allen AI, 2024] to pretrain STACKTRANS. We use the lighteval framework [huggingface, 2024], and for all applicable tasks, we adhere to zero-shot evaluation settings, unless otherwise specified.
Experiment Setup	Yes	Concretely, we use five Transformer layers with d = 64. H is set to 4 and ds is set to 8. We train STACKTRANS models with a range of parameter sizes (360M, 600M, 1.0B, 1.5B, and 7B) under the same training budget in terms of tokens. We pre-train STACKTRANS-360M from scratch, and the detailed model configuration is shown in F. Table 6: Model configuration of STACKTRANS-360M (Vocabulary Size 49152, Number of Attention Heads 15, Number of Hidden Layers 32, Hidden Size 960, Intermediate Size (FFN) 2560, Attention Dropout 0.0, Activation Function Silu, Number of Stack Heads 4, Stack Dimensionality 16, Stack Size 24, Maximum Position Embeddings 4096, RoPE Scaling None, RoPE θ 100000). The overall loss function combines the language modeling loss and the stack regularization term, L = LLM + λ LSt, where λ is a hyperparameter. As a regularization term, we give λ a small weight in experiments, e.g., 0.001.