reproducibilityindex.ai

Improving Transformers with Dynamically Composable Multi-Head Attention

Authors: Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We implement DCMHA / DCFormer and conduct analysis and extensive experiments to evaluate its effectiveness, efficiency and scalability. Experimental results show that DCFormer significantly outperforms Transformer on different architectures (original or the advanced LLa MA architecture) and model scales (from 405M to 6.9B) in language modeling, matching the performance of models with ~1.7 -2 compute.
Researcher Affiliation	Collaboration	1School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China 2Colorful Clouds Technology Co.,Ltd., Beijing, China. Correspondence to: Da Xiao <xiaoda99@bupt.edu.cn>.
Pseudocode	Yes	Appendix D. Pseudo-code for DCMHA
Open Source Code	Yes	The code and models are available at https: //github.com/Caiyun-AI/DCFormer.
Open Datasets	Yes	We use the Pile dataset (Gao et al., 2020) for all our language modeling experiments.
Dataset Splits	No	The paper mentions using the Pile dataset for language modeling experiments and notes validation loss and perplexity, but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproduction within the main text.
Hardware Specification	Yes	We train on TPU v3 pods... The 2.8B and 6.9B models are trained on 256 TPU v3 chips while the 13B and 33B models are trained on 512 chips. For inference we use A100 80G GPU...
Software Dependencies	No	The paper mentions implementing the model and training in JAX and inference in PyTorch, also noting the use of torch.compile. However, it does not specify version numbers for these or any other software dependencies, which are required for reproducible descriptions.
Experiment Setup	Yes	Table 3 specifies the model sizes and hyperparameters for scaling experiments. The model architectures, learning rates and batch sizes are mostly taken from GPT3 specifications (Brown et al., 2020)." and "We use the Adam W optimizer with β1 = 0.9, β2 = 0.95, gradient clip value of 1.0, weight decay of 0.1, 1% learning rate warmup steps followed by cosine decay to 10% of its maximal value, and no dropout.