JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention
Authors: Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Shaolei Du
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3 |
| Researcher Affiliation | Collaboration | Yuandong Tian AI@Meta (FAIR) yuandong@meta.com Yiping Wang University of Washington ypwang61@cs.washington.edu Zhenyu Zhang University of Texas at Austin zhenyu.zhang@utexas.edu Beidi Chen Carnegie Mellon University, AI@Meta (FAIR) beidic@meta.com, beidic@andrew.cmu.edu Simon Du University of Washington ssdu@cs.washington.edu |
| Pseudocode | No | The paper includes mathematical derivations, equations, and theorems, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3 |
| Open Datasets | Yes | Experiments on models trained from real-world dataset (Wikitext2/Wikitext103 (Merity et al., 2016))... We also tested our hypothesis in OPT (Zhang et al., 2022) (OPT-2.7B) and Pythia (Biderman et al., 2023) (Pythia-70M/1.4B/6.9B) pre-trained models, both of which has public intermediate checkpoints. |
| Dataset Splits | No | The paper mentions 'yield the best validation score' and 'val_loss', but does not specify the dataset splits (e.g., percentages or sample counts) used for training, validation, and testing. |
| Hardware Specification | No | The paper mentions training models like OPT and Pythia and performing experiments, but it does not specify any particular hardware components such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or cloud computing instance types used for these experiments. |
| Software Dependencies | No | The paper mentions 'Adam optimizer is used' but does not specify the versions of any software libraries or frameworks (e.g., PyTorch, TensorFlow, scikit-learn) that were used to implement and run the experiments. |
| Experiment Setup | Yes | We use 10 4 learning rate and test our hypothesis on Wikitext2/Wikitext103 (Merity et al., 2016) (top/bottom row). ... Adam optimizer is used with learning rate 10 5. Vocabulary size M = 100, sequence length T = 30 and embedding dimension d = 1024. |