Improving Transformers with Dynamically Composable Multi-Head Attention

Authors: Da Xiao, Qingye Meng, Shengping Li, Xingyuan Yuan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement DCMHA / DCFormer and conduct analysis and extensive experiments to evaluate its effectiveness, efficiency and scalability. Experimental results show that DCFormer significantly outperforms Transformer on different architectures (original or the advanced LLa MA architecture) and model scales (from 405M to 6.9B) in language modeling, matching the performance of models with ~1.7 -2 compute.
Researcher Affiliation Collaboration 1School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing, China 2Colorful Clouds Technology Co.,Ltd., Beijing, China. Correspondence to: Da Xiao <xiaoda99@bupt.edu.cn>.
Pseudocode Yes Appendix D. Pseudo-code for DCMHA
Open Source Code Yes The code and models are available at https: //github.com/Caiyun-AI/DCFormer.
Open Datasets Yes We use the Pile dataset (Gao et al., 2020) for all our language modeling experiments.
Dataset Splits No The paper mentions using the Pile dataset for language modeling experiments and notes validation loss and perplexity, but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) for reproduction within the main text.
Hardware Specification Yes We train on TPU v3 pods... The 2.8B and 6.9B models are trained on 256 TPU v3 chips while the 13B and 33B models are trained on 512 chips. For inference we use A100 80G GPU...
Software Dependencies No The paper mentions implementing the model and training in JAX and inference in PyTorch, also noting the use of torch.compile. However, it does not specify version numbers for these or any other software dependencies, which are required for reproducible descriptions.
Experiment Setup Yes Table 3 specifies the model sizes and hyperparameters for scaling experiments. The model architectures, learning rates and batch sizes are mostly taken from GPT3 specifications (Brown et al., 2020)." and "We use the Adam W optimizer with β1 = 0.9, β2 = 0.95, gradient clip value of 1.0, weight decay of 0.1, 1% learning rate warmup steps followed by cosine decay to 10% of its maximal value, and no dropout.