JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Authors: Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Shaolei Du

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3
Researcher Affiliation Collaboration Yuandong Tian AI@Meta (FAIR) yuandong@meta.com Yiping Wang University of Washington ypwang61@cs.washington.edu Zhenyu Zhang University of Texas at Austin zhenyu.zhang@utexas.edu Beidi Chen Carnegie Mellon University, AI@Meta (FAIR) beidic@meta.com, beidic@andrew.cmu.edu Simon Du University of Washington ssdu@cs.washington.edu
Pseudocode No The paper includes mathematical derivations, equations, and theorems, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3
Open Datasets Yes Experiments on models trained from real-world dataset (Wikitext2/Wikitext103 (Merity et al., 2016))... We also tested our hypothesis in OPT (Zhang et al., 2022) (OPT-2.7B) and Pythia (Biderman et al., 2023) (Pythia-70M/1.4B/6.9B) pre-trained models, both of which has public intermediate checkpoints.
Dataset Splits No The paper mentions 'yield the best validation score' and 'val_loss', but does not specify the dataset splits (e.g., percentages or sample counts) used for training, validation, and testing.
Hardware Specification No The paper mentions training models like OPT and Pythia and performing experiments, but it does not specify any particular hardware components such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or cloud computing instance types used for these experiments.
Software Dependencies No The paper mentions 'Adam optimizer is used' but does not specify the versions of any software libraries or frameworks (e.g., PyTorch, TensorFlow, scikit-learn) that were used to implement and run the experiments.
Experiment Setup Yes We use 10 4 learning rate and test our hypothesis on Wikitext2/Wikitext103 (Merity et al., 2016) (top/bottom row). ... Adam optimizer is used with learning rate 10 5. Vocabulary size M = 100, sequence length T = 30 and embedding dimension d = 1024.