reproducibilityindex.ai

Improving Transformer with an Admixture of Attention Heads

Authors: Tan Nguyen, Tam Nguyen, Hai Do, Khai Nguyen, Vishwanath Saragadam, Minh Pham, Khuong Duy Nguyen, Nhat Ho, Stanley Osher

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we empirically study the advantages of Fi SHformer on various tasks and benchmarks, including language modeling on Wiki Text-103 (Section 3.1), machine translation on IWSLT 14 De-En and WMT 14 (Section 3.2), image classiﬁcation on Image Net (Section 3.3), time series classiﬁcation on the UEA benchmark (Section 3.4), and reinforcement learning on the D4RL Benchmark (Section 3.5).
Researcher Affiliation	Collaboration	Tan M. Nguyen Department of Mathematics University of California, Los Angeles tanmnguyen89@ucla.edu Tam Nguyen FPT Software AI Center nguyenminhtam9520@gmail.com Hai Do FPT Software AI Center haidn6@fsoft.com.vn Khai Nguyen Department of Statistics and Data Sciences University of Texas at Austin khainb@utexas.edu Vishwanath Saragadam Department of ECE Rice University vishwanath.saragadam@rice.edu Minh Pham Department of Mathematics University of California, Los Angeles minhrose@ucla.edu Duy Khuong Nguyen FPT Software AI Center khuongnd6@fsoft.com.vn Nhat Ho Department of Statistics and Data Sciences University of Texas at Austin minhnhat@utexas.edu Stanley J. Osher Department of Mathematics University of California, Los Angeles sjo@math.ucla.edu
Pseudocode	No	The paper describes methods using text and mathematical equations but does not provide structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our Py Torch code with documentation can be found at https://github.com/minhtannguyen/Fish Former.
Open Datasets	Yes	We compare the 2 and 4-global-head Fi SHformers with the 8-head softmax transformers [75]. Each model has 16 layers, and our training follows the setting from [66]. ... For the IWSLT 14 De-En task, we compare 2-global-heads (G)Fi SHformers with the baseline 4-head softmax transformer. ... For the WMT 14 En-De task, ... we compare (G)Fi SHformers of 8 and 4 global heads with the 16-head MHA softmax baseline. ... Swin transformer [44], a state-of-the-art vision transformer architecture, for the image classiﬁcation task on the Image Net dataset [20]. ... UEA Time Series Classiﬁcation Archive benchmark [5]. ... D4RL benchmark [29].
Dataset Splits	Yes	Table 1: Perplexity (PPL) on Wiki Text-103 compared to the baselines. Method Valid PPL Test PPL ... Each model has 16 layers, and our training follows the setting from [66]. ... Our training and model setting are the same as those in [53].
Hardware Specification	No	The paper mentions 'GPU memory usage' but does not provide specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud instance types used for experiments.
Software Dependencies	No	The paper mentions 'Our Py Torch code with documentation' in Section 3, but does not provide a specific version number for PyTorch or any other software dependency.
Experiment Setup	Yes	All of our results are averaged over 5 runs with different seeds. More details on datasets, models, and training are provided in Appendix A. ... Each model has 16 layers, and our training follows the setting from [66]. ... Our experiments follow the setting on fairseq. ... Our training and model setting are the same as those in [53].