Algorithmic Capabilities of Random Transformers
Authors: Ziqian Zhong, Jacob Andreas
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on seven tasks, we find that embedding-only training yields accurate models for a diverse set of problems spanning arithmetic, associative recall, and sequence generation in some cases substantially outperforming similarly trained recurrent models. Results are shown in Table 1. |
| Researcher Affiliation | Academia | Ziqian Zhong, Jacob Andreas Massachusetts Institute of Technology {ziqianz, jda}@mit.edu |
| Pseudocode | No | No pseudocode or algorithm block is present in the paper. |
| Open Source Code | Yes | Code is available at https://github.com/fjzzq2002/random_transformers. |
| Open Datasets | Yes | We train models on the Tiny Stories dataset, a collection of easy-to-understand stories generated by GPT-3.5 and GPT-4 [14] |
| Dataset Splits | Yes | For the modular addition task, we partition the full set of well-formed input output pairs into a fixed train/test split; for the other problems, we pre-generate a fixed test set but randomly generate new pairs for each training batch. Additional details may be found in Appendix D.1. For Modular Addition (D.1.1): We randomly shuffle all possible inputs (p2 of them) perform a 95%: 5% for training and test set. |
| Hardware Specification | Yes | Roughly 154 GPU days of NVidia V100 were spent on this project. |
| Software Dependencies | No | The paper mentions using 'GPT-2 [38] implementation of Huggingface [48]' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For synthetic experiments, we used Adam W optimizer [27] with a learning rate 10 3 and weight decay 10 3. For LSTM a learning rate 5 10 3 is used for faster convergence. For the language modeling task, we used Adam W optimizer with a learning rate 6 10 4 and weight decay 0.1. We clip all gradient norms at 1. Modular Addition: 5000 epoches. Batch size 4000. Needle-in-a-Haystack, Decimal Addition, Parentheses Balancing, Circuit Imitation: 104 steps of batch size 1000. Memorization: 21000 epoches. Batch size 215. Language Modeling: 5 epoches. Batch size 20 and context window 512. |