Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Position: General Intelligence Requires Reward-based Pretraining
Authors: Seungwook Han, Jyothish Pari, Samuel J. Gershman, Pulkit Agrawal
ICML 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments with algorithmic tasks in esoteric programming languages reveal that LLM s reasoning overfits to the training data and is limited in its transferability. Our results in Section 3 show that state-of-the-art LLMs struggle to transfer their algorithmic understanding to coding in new programming syntaxes. |
| Researcher Affiliation | Academia | 1Improbable AI Lab, MIT 2Department of Psychology and Center for Brain Science, Harvard University. Correspondence to: Seungwook Han <EMAIL>, Jyothish Pari <EMAIL>. |
| Pseudocode | Yes | Algorithm 1 Curriculum-Guided Reasoning with External Memory |
| Open Source Code | No | The paper does not provide an explicit statement or a link to source code for the methodology described. It mentions using existing libraries or models like 'Google Deep Mind s mctx library' and 'Qwen 1.5B', but not its own implementation code. |
| Open Datasets | Yes | We collect 80,824 professional 9 9 Go game trajectories from online sources such as Go Quest (Go Quest, 2024) and other research archives (M uller, 2024; Brouwer, 2024) |
| Dataset Splits | Yes | We created 100 training examples and 100 test examples. The number of examples used for each language-task evaluation are as follows: Brainf**k Copy: 100, Brainf**k Print: 676, Brainf**k Sort: 100, Befunge Print: 100, Befunge Fibonacci: 1, Befunge Factorial: 15. We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4. |
| Hardware Specification | Yes | This training procedure takes approximately 14 days on 4 A100 GPUs. |
| Software Dependencies | No | The paper mentions using 'Google Deep Mind s mctx library (Deep Mind, 2024)' but does not specify a version number for this or any other key software components, such as programming languages or frameworks. |
| Experiment Setup | Yes | We finetuned the Qwen/Qwen2.5-1.5B-Instruct model on 100 synthetic examples for 100 epochs using Low-Rank Adaptation (Lo RA) (Hu et al., 2021) with rank=256, alpha=32, and a dropout rate of 0.05 applied to the query, key, and value matrices (Q, K, V). The training employed a cosine learning rate schedule with an initial learning rate of 5e-4, a batch size of 64, and 10 warmup steps. ... We train the network for 10 epochs with a batch size of 1024, a learning rate of 10 3, and weight decay of 10 4. ... We use a batch size of 1024, a starting learning rate of 10 2 (with cosine decay over 200 total iterations), and weight decay of 10 4. ... The RL hyperparameters included a batch size of 36, a single PPO epoch per iteration, and a KL coefficient of 0.5. |