Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Algorithmic Capabilities of Random Transformers
Authors: Ziqian Zhong, Jacob Andreas
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on seven tasks, we find that embedding-only training yields accurate models for a diverse set of problems spanning arithmetic, associative recall, and sequence generation in some cases substantially outperforming similarly trained recurrent models. Results are shown in Table 1. |
| Researcher Affiliation | Academia | Ziqian Zhong, Jacob Andreas Massachusetts Institute of Technology EMAIL |
| Pseudocode | No | No pseudocode or algorithm block is present in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/fjzzq2002/random_transformers. |
| Open Datasets | Yes | We train models on the Tiny Stories dataset, a collection of easy-to-understand stories generated by GPT-3.5 and GPT-4 [14] |
| Dataset Splits | Yes | For the modular addition task, we partition the full set of well-formed input output pairs into a fixed train/test split; for the other problems, we pre-generate a fixed test set but randomly generate new pairs for each training batch. Additional details may be found in Appendix D.1. For Modular Addition (D.1.1): We randomly shuffle all possible inputs (p2 of them) perform a 95%: 5% for training and test set. |
| Hardware Specification | Yes | Roughly 154 GPU days of NVidia V100 were spent on this project. |
| Software Dependencies | No | The paper mentions using 'GPT-2 [38] implementation of Huggingface [48]' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For synthetic experiments, we used Adam W optimizer [27] with a learning rate 10 3 and weight decay 10 3. For LSTM a learning rate 5 10 3 is used for faster convergence. For the language modeling task, we used Adam W optimizer with a learning rate 6 10 4 and weight decay 0.1. We clip all gradient norms at 1. Modular Addition: 5000 epoches. Batch size 4000. Needle-in-a-Haystack, Decimal Addition, Parentheses Balancing, Circuit Imitation: 104 steps of batch size 1000. Memorization: 21000 epoches. Batch size 215. Language Modeling: 5 epoches. Batch size 20 and context window 512. |