Improving Transformer with an Admixture of Attention Heads

Authors: Tan Nguyen, Tam Nguyen, Hai Do, Khai Nguyen, Vishwanath Saragadam, Minh Pham, Khuong Duy Nguyen, Nhat Ho, Stanley Osher

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically study the advantages of Fi SHformer on various tasks and benchmarks, including language modeling on Wiki Text-103 (Section 3.1), machine translation on IWSLT 14 De-En and WMT 14 (Section 3.2), image classification on Image Net (Section 3.3), time series classification on the UEA benchmark (Section 3.4), and reinforcement learning on the D4RL Benchmark (Section 3.5).
Researcher Affiliation Collaboration Tan M. Nguyen Department of Mathematics University of California, Los Angeles tanmnguyen89@ucla.edu Tam Nguyen FPT Software AI Center nguyenminhtam9520@gmail.com Hai Do FPT Software AI Center haidn6@fsoft.com.vn Khai Nguyen Department of Statistics and Data Sciences University of Texas at Austin khainb@utexas.edu Vishwanath Saragadam Department of ECE Rice University vishwanath.saragadam@rice.edu Minh Pham Department of Mathematics University of California, Los Angeles minhrose@ucla.edu Duy Khuong Nguyen FPT Software AI Center khuongnd6@fsoft.com.vn Nhat Ho Department of Statistics and Data Sciences University of Texas at Austin minhnhat@utexas.edu Stanley J. Osher Department of Mathematics University of California, Los Angeles sjo@math.ucla.edu
Pseudocode No The paper describes methods using text and mathematical equations but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Our Py Torch code with documentation can be found at https://github.com/minhtannguyen/Fish Former.
Open Datasets Yes We compare the 2 and 4-global-head Fi SHformers with the 8-head softmax transformers [75]. Each model has 16 layers, and our training follows the setting from [66]. ... For the IWSLT 14 De-En task, we compare 2-global-heads (G)Fi SHformers with the baseline 4-head softmax transformer. ... For the WMT 14 En-De task, ... we compare (G)Fi SHformers of 8 and 4 global heads with the 16-head MHA softmax baseline. ... Swin transformer [44], a state-of-the-art vision transformer architecture, for the image classification task on the Image Net dataset [20]. ... UEA Time Series Classification Archive benchmark [5]. ... D4RL benchmark [29].
Dataset Splits Yes Table 1: Perplexity (PPL) on Wiki Text-103 compared to the baselines. Method Valid PPL Test PPL ... Each model has 16 layers, and our training follows the setting from [66]. ... Our training and model setting are the same as those in [53].
Hardware Specification No The paper mentions 'GPU memory usage' but does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud instance types used for experiments.
Software Dependencies No The paper mentions 'Our Py Torch code with documentation' in Section 3, but does not provide a specific version number for PyTorch or any other software dependency.
Experiment Setup Yes All of our results are averaged over 5 runs with different seeds. More details on datasets, models, and training are provided in Appendix A. ... Each model has 16 layers, and our training follows the setting from [66]. ... Our experiments follow the setting on fairseq. ... Our training and model setting are the same as those in [53].