Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention
Authors: Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Shaolei Du
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3 |
| Researcher Affiliation | Collaboration | Yuandong Tian AI@Meta (FAIR) EMAIL Yiping Wang University of Washington EMAIL Zhenyu Zhang University of Texas at Austin EMAIL Beidi Chen Carnegie Mellon University, AI@Meta (FAIR) EMAIL, EMAIL Simon Du University of Washington EMAIL |
| Pseudocode | No | The paper includes mathematical derivations, equations, and theorems, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3 |
| Open Datasets | Yes | Experiments on models trained from real-world dataset (Wikitext2/Wikitext103 (Merity et al., 2016))... We also tested our hypothesis in OPT (Zhang et al., 2022) (OPT-2.7B) and Pythia (Biderman et al., 2023) (Pythia-70M/1.4B/6.9B) pre-trained models, both of which has public intermediate checkpoints. |
| Dataset Splits | No | The paper mentions 'yield the best validation score' and 'val_loss', but does not specify the dataset splits (e.g., percentages or sample counts) used for training, validation, and testing. |
| Hardware Specification | No | The paper mentions training models like OPT and Pythia and performing experiments, but it does not specify any particular hardware components such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or cloud computing instance types used for these experiments. |
| Software Dependencies | No | The paper mentions 'Adam optimizer is used' but does not specify the versions of any software libraries or frameworks (e.g., PyTorch, TensorFlow, scikit-learn) that were used to implement and run the experiments. |
| Experiment Setup | Yes | We use 10 4 learning rate and test our hypothesis on Wikitext2/Wikitext103 (Merity et al., 2016) (top/bottom row). ... Adam optimizer is used with learning rate 10 5. Vocabulary size M = 100, sequence length T = 30 and embedding dimension d = 1024. |