Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals
Authors: Tam Nguyen, Tan Nguyen, Richard Baraniuk
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of Neu TRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling. |
| Researcher Affiliation | Academia | Tam Nguyen Department of Electrical & Computer Engineering Rice University Houston, USA mn72@rice.edu Tan M. Nguyen Department of Mathematics National University of Singapore Singapore tanmn@nus.edu.sg Richard G. Baraniuk Department of Electrical & Computer Engineering Rice University Houston, USA richb@rice.edu |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code to reproduce our experimental results is included in our Supplementary Material submission. |
| Open Datasets | Yes | The Image Net dataset [15, 52] comprises 1.28 million training images and 50, 000 validation images, encompassing the classification of 1000 categories. The ADE20K dataset is recognized for its inclusion of challenging scenes with fine-grained labels... The Wiki Text-103 dataset consists of articles extracted from Wikipedia and is specifically designed to capture long contextual dependencies. [42] |
| Dataset Splits | Yes | The Image Net dataset [15, 52] comprises 1.28 million training images and 50, 000 validation images... The training set consists of 20,210 images... Additionally, there are 2,000 images in the validation set and 3,352 images in the test set. ... The validation and test sets contain 218, 000 and 246, 000 running words, respectively... |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud computing instance specifications) used for running the experiments. |
| Software Dependencies | No | The paper refers to existing codebases and models (e.g., Dei T baseline, lmtool-fwp) but does not specify versions of programming languages, libraries, or frameworks used (e.g., Python version, PyTorch/TensorFlow version). |
| Experiment Setup | Yes | Our baseline model is the Dei T-tiny model [59], which consists of 12 transformer layers, 3 attention heads per layer, and a model dimension of 192... The λ used for our Neu TRENO method is 0.6. ... In our experiments, we set the dimensions of keys, values, and queries to 128, while the training and evaluation context length is set to 256. In this experiment, λ = 0.4 yields the best performance of Neu TRENO language model. |