Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Algorithmic Capabilities of Random Transformers

Authors: Ziqian Zhong, Jacob Andreas

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on seven tasks, we find that embedding-only training yields accurate models for a diverse set of problems spanning arithmetic, associative recall, and sequence generation in some cases substantially outperforming similarly trained recurrent models. Results are shown in Table 1.
Researcher Affiliation Academia Ziqian Zhong, Jacob Andreas Massachusetts Institute of Technology EMAIL
Pseudocode No No pseudocode or algorithm block is present in the paper.
Open Source Code Yes Code is available at https://github.com/fjzzq2002/random_transformers.
Open Datasets Yes We train models on the Tiny Stories dataset, a collection of easy-to-understand stories generated by GPT-3.5 and GPT-4 [14]
Dataset Splits Yes For the modular addition task, we partition the full set of well-formed input output pairs into a fixed train/test split; for the other problems, we pre-generate a fixed test set but randomly generate new pairs for each training batch. Additional details may be found in Appendix D.1. For Modular Addition (D.1.1): We randomly shuffle all possible inputs (p2 of them) perform a 95%: 5% for training and test set.
Hardware Specification Yes Roughly 154 GPU days of NVidia V100 were spent on this project.
Software Dependencies No The paper mentions using 'GPT-2 [38] implementation of Huggingface [48]' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For synthetic experiments, we used Adam W optimizer [27] with a learning rate 10 3 and weight decay 10 3. For LSTM a learning rate 5 10 3 is used for faster convergence. For the language modeling task, we used Adam W optimizer with a learning rate 6 10 4 and weight decay 0.1. We clip all gradient norms at 1. Modular Addition: 5000 epoches. Batch size 4000. Needle-in-a-Haystack, Decimal Addition, Parentheses Balancing, Circuit Imitation: 104 steps of batch size 1000. Memorization: 21000 epoches. Batch size 215. Language Modeling: 5 epoches. Batch size 20 and context window 512.