Improving Transformer with an Admixture of Attention Heads
Authors: Tan Nguyen, Tam Nguyen, Hai Do, Khai Nguyen, Vishwanath Saragadam, Minh Pham, Khuong Duy Nguyen, Nhat Ho, Stanley Osher
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we empirically study the advantages of Fi SHformer on various tasks and benchmarks, including language modeling on Wiki Text-103 (Section 3.1), machine translation on IWSLT 14 De-En and WMT 14 (Section 3.2), image classification on Image Net (Section 3.3), time series classification on the UEA benchmark (Section 3.4), and reinforcement learning on the D4RL Benchmark (Section 3.5). |
| Researcher Affiliation | Collaboration | Tan M. Nguyen Department of Mathematics University of California, Los Angeles tanmnguyen89@ucla.edu Tam Nguyen FPT Software AI Center nguyenminhtam9520@gmail.com Hai Do FPT Software AI Center haidn6@fsoft.com.vn Khai Nguyen Department of Statistics and Data Sciences University of Texas at Austin khainb@utexas.edu Vishwanath Saragadam Department of ECE Rice University vishwanath.saragadam@rice.edu Minh Pham Department of Mathematics University of California, Los Angeles minhrose@ucla.edu Duy Khuong Nguyen FPT Software AI Center khuongnd6@fsoft.com.vn Nhat Ho Department of Statistics and Data Sciences University of Texas at Austin minhnhat@utexas.edu Stanley J. Osher Department of Mathematics University of California, Los Angeles sjo@math.ucla.edu |
| Pseudocode | No | The paper describes methods using text and mathematical equations but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our Py Torch code with documentation can be found at https://github.com/minhtannguyen/Fish Former. |
| Open Datasets | Yes | We compare the 2 and 4-global-head Fi SHformers with the 8-head softmax transformers [75]. Each model has 16 layers, and our training follows the setting from [66]. ... For the IWSLT 14 De-En task, we compare 2-global-heads (G)Fi SHformers with the baseline 4-head softmax transformer. ... For the WMT 14 En-De task, ... we compare (G)Fi SHformers of 8 and 4 global heads with the 16-head MHA softmax baseline. ... Swin transformer [44], a state-of-the-art vision transformer architecture, for the image classification task on the Image Net dataset [20]. ... UEA Time Series Classification Archive benchmark [5]. ... D4RL benchmark [29]. |
| Dataset Splits | Yes | Table 1: Perplexity (PPL) on Wiki Text-103 compared to the baselines. Method Valid PPL Test PPL ... Each model has 16 layers, and our training follows the setting from [66]. ... Our training and model setting are the same as those in [53]. |
| Hardware Specification | No | The paper mentions 'GPU memory usage' but does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud instance types used for experiments. |
| Software Dependencies | No | The paper mentions 'Our Py Torch code with documentation' in Section 3, but does not provide a specific version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | All of our results are averaged over 5 runs with different seeds. More details on datasets, models, and training are provided in Appendix A. ... Each model has 16 layers, and our training follows the setting from [66]. ... Our experiments follow the setting on fairseq. ... Our training and model setting are the same as those in [53]. |