Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reinforced Context Order Recovery for Adaptive Reasoning and Planning
Authors: Long Ma, Fangwei Zhong, Yizhou Wang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments In this section, we aim to answer the following questions with our experiments: 1) Can Re COR solve arithmetic problems without special data preprocessing? 2) Can Re COR solve reasoning and planning problems adaptively without annotations? 3) Do we need adaptive orders during training or for inference only? 4) How does Re COR compare with the state-of-the-art methods under fair inference compute settings? 5) Can the performance of Re COR scale with more compute? Table 1: Performance of Re COR and baselines on arithmetic datasets. Re COR outperforms baselines and is competitive with the oracle. Figure 3: Re COR s performance when scaling the number of token queries K (a) and order queries C (b). Main Stream in (b) denotes using the main stream outputs without a separate order query stream. Re COR can improve its performance with more computation during training and inference. |
| Researcher Affiliation | Academia | Long Ma Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China EMAIL Fangwei Zhong B School of Artificial Intelligence, Beijing Normal University, Beijing, China EMAIL Yizhou Wang School of Computer Science, Institute for Artificial Intelligence, State Key Laboratory of General Artificial Intelligence, Peking University, Beijing, China EMAIL |
| Pseudocode | Yes | B Pseudocode of Training and Inference Algorithms for Re COR Algorithm 1 Training of Re COR. Algorithm 2 Inference of Re COR. |
| Open Source Code | No | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We will release our code and data upon acceptance. |
| Open Datasets | Yes | We use the Sudoku dataset from [10]. The prompt is a string of length 81 that contains the flattened initial configuration, where each element is either in [1, 9] denoting a given cell, or 0 indicating that the value in this cell is missing. The response is also a string of length 81 that contains the full solution. N = M = 81. The training set contains approximately 1.8 106 samples while the test set contains 105 instances. We also use the Zebra dataset from [10]. The prompt for Zebra puzzles consists of a set of clues; we tokenize the special words in the clues with corresponding special tokens. After tokenization, N = 455, M = 42. The training set contains about 1.5 106 puzzles while the test set contains 105. [10] Kulin Shah, Nishanth Dikkala, Xin Wang, and Rina Panigrahy. Causal language modeling can elicit search and reasoning capabilities on logic puzzles. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. |
| Dataset Splits | Yes | For all datasets, we randomly split a small validation set of size 448 from the training set for validation purposes. We generate a synthetic Autoregression (ARG) task... The size of the training set is 106 while the size of the test set is 103. We generate a multiplication dataset... We generate a training set of size 105 and a test set of size 103. We use the Sudoku dataset from [10]... The training set contains approximately 1.8 106 samples while the test set contains 105 instances. We also use the Zebra dataset from [10]... The training set contains about 1.5 106 puzzles while the test set contains 105. |
| Hardware Specification | Yes | Each experiment takes a couple of hours using a single NVIDIA RTX4090 on Autoregression and Multiplication, and less than 2 days using a single NVIDIA A100-80G on Sudoku and Zebra. |
| Software Dependencies | No | The paper mentions mixed-precision training with bfloat16 and float32 for model parameters but does not provide specific software names or version numbers for libraries or frameworks used, which are required for a reproducible description of ancillary software. |
| Experiment Setup | Yes | D.2.2 Hyperparameters We list hyperparameters on Autoregression and Multiplication in Tab. 5 while describing Re COR-related ones on all datasets in Tab. 6. Baseline performances for Sudoku and Zebra are reported by [12]. In Autoregression and Multiplication, compared with Re COR, we double the batch size and number of epochs for baselines to match the amount of compute per iteration and number of gradient steps of Re COR to ensure a fair comparison. |