reproducibilityindex.ai

JoMA: Demystifying Multilayer Transformers via Joint Dynamics of MLP and Attention

Authors: Yuandong Tian, Yiping Wang, Zhenyu Zhang, Beidi Chen, Simon Shaolei Du

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3
Researcher Affiliation	Collaboration	Yuandong Tian AI@Meta (FAIR) yuandong@meta.com Yiping Wang University of Washington ypwang61@cs.washington.edu Zhenyu Zhang University of Texas at Austin zhenyu.zhang@utexas.edu Beidi Chen Carnegie Mellon University, AI@Meta (FAIR) beidic@meta.com, beidic@andrew.cmu.edu Simon Du University of Washington ssdu@cs.washington.edu
Pseudocode	No	The paper includes mathematical derivations, equations, and theorems, but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The code is at1. 1https://github.com/facebookresearch/luckmatters/tree/yuandong3
Open Datasets	Yes	Experiments on models trained from real-world dataset (Wikitext2/Wikitext103 (Merity et al., 2016))... We also tested our hypothesis in OPT (Zhang et al., 2022) (OPT-2.7B) and Pythia (Biderman et al., 2023) (Pythia-70M/1.4B/6.9B) pre-trained models, both of which has public intermediate checkpoints.
Dataset Splits	No	The paper mentions 'yield the best validation score' and 'val_loss', but does not specify the dataset splits (e.g., percentages or sample counts) used for training, validation, and testing.
Hardware Specification	No	The paper mentions training models like OPT and Pythia and performing experiments, but it does not specify any particular hardware components such as GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or cloud computing instance types used for these experiments.
Software Dependencies	No	The paper mentions 'Adam optimizer is used' but does not specify the versions of any software libraries or frameworks (e.g., PyTorch, TensorFlow, scikit-learn) that were used to implement and run the experiments.
Experiment Setup	Yes	We use 10 4 learning rate and test our hypothesis on Wikitext2/Wikitext103 (Merity et al., 2016) (top/bottom row). ... Adam optimizer is used with learning rate 10 5. Vocabulary size M = 100, sequence length T = 30 and embedding dimension d = 1024.