reproducibilityindex.ai

Algorithmic Capabilities of Random Transformers

Authors: Ziqian Zhong, Jacob Andreas

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments on seven tasks, we find that embedding-only training yields accurate models for a diverse set of problems spanning arithmetic, associative recall, and sequence generation in some cases substantially outperforming similarly trained recurrent models. Results are shown in Table 1.
Researcher Affiliation	Academia	Ziqian Zhong, Jacob Andreas Massachusetts Institute of Technology {ziqianz, jda}@mit.edu
Pseudocode	No	No pseudocode or algorithm block is present in the paper.
Open Source Code	Yes	Code is available at https://github.com/fjzzq2002/random_transformers.
Open Datasets	Yes	We train models on the Tiny Stories dataset, a collection of easy-to-understand stories generated by GPT-3.5 and GPT-4 [14]
Dataset Splits	Yes	For the modular addition task, we partition the full set of well-formed input output pairs into a fixed train/test split; for the other problems, we pre-generate a fixed test set but randomly generate new pairs for each training batch. Additional details may be found in Appendix D.1. For Modular Addition (D.1.1): We randomly shuffle all possible inputs (p2 of them) perform a 95%: 5% for training and test set.
Hardware Specification	Yes	Roughly 154 GPU days of NVidia V100 were spent on this project.
Software Dependencies	No	The paper mentions using 'GPT-2 [38] implementation of Huggingface [48]' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	For synthetic experiments, we used Adam W optimizer [27] with a learning rate 10 3 and weight decay 10 3. For LSTM a learning rate 5 10 3 is used for faster convergence. For the language modeling task, we used Adam W optimizer with a learning rate 6 10 4 and weight decay 0.1. We clip all gradient norms at 1. Modular Addition: 5000 epoches. Batch size 4000. Needle-in-a-Haystack, Decimal Addition, Parentheses Balancing, Circuit Imitation: 104 steps of batch size 1000. Memorization: 21000 epoches. Batch size 215. Language Modeling: 5 epoches. Batch size 20 and context window 512.