Algorithmic Capabilities of Random Transformers

Authors: Ziqian Zhong, Jacob Andreas

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In experiments on seven tasks, we find that embedding-only training yields accurate models for a diverse set of problems spanning arithmetic, associative recall, and sequence generation in some cases substantially outperforming similarly trained recurrent models. Results are shown in Table 1.
Researcher Affiliation Academia Ziqian Zhong, Jacob Andreas Massachusetts Institute of Technology {ziqianz, jda}@mit.edu
Pseudocode No No pseudocode or algorithm block is present in the paper.
Open Source Code Yes Code is available at https://github.com/fjzzq2002/random_transformers.
Open Datasets Yes We train models on the Tiny Stories dataset, a collection of easy-to-understand stories generated by GPT-3.5 and GPT-4 [14]
Dataset Splits Yes For the modular addition task, we partition the full set of well-formed input output pairs into a fixed train/test split; for the other problems, we pre-generate a fixed test set but randomly generate new pairs for each training batch. Additional details may be found in Appendix D.1. For Modular Addition (D.1.1): We randomly shuffle all possible inputs (p2 of them) perform a 95%: 5% for training and test set.
Hardware Specification Yes Roughly 154 GPU days of NVidia V100 were spent on this project.
Software Dependencies No The paper mentions using 'GPT-2 [38] implementation of Huggingface [48]' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For synthetic experiments, we used Adam W optimizer [27] with a learning rate 10 3 and weight decay 10 3. For LSTM a learning rate 5 10 3 is used for faster convergence. For the language modeling task, we used Adam W optimizer with a learning rate 6 10 4 and weight decay 0.1. We clip all gradient norms at 1. Modular Addition: 5000 epoches. Batch size 4000. Needle-in-a-Haystack, Decimal Addition, Parentheses Balancing, Circuit Imitation: 104 steps of batch size 1000. Memorization: 21000 epoches. Batch size 215. Language Modeling: 5 epoches. Batch size 20 and context window 512.