Mitigating Over-smoothing in Transformers via Regularized Nonlocal Functionals

Authors: Tam Nguyen, Tan Nguyen, Richard Baraniuk

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate the advantages of Neu TRENO over the baseline transformers and state-of-the-art methods in reducing the over-smoothing of token representations on various practical tasks, including object classification, image segmentation, and language modeling.
Researcher Affiliation Academia Tam Nguyen Department of Electrical & Computer Engineering Rice University Houston, USA mn72@rice.edu Tan M. Nguyen Department of Mathematics National University of Singapore Singapore tanmn@nus.edu.sg Richard G. Baraniuk Department of Electrical & Computer Engineering Rice University Houston, USA richb@rice.edu
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code to reproduce our experimental results is included in our Supplementary Material submission.
Open Datasets Yes The Image Net dataset [15, 52] comprises 1.28 million training images and 50, 000 validation images, encompassing the classification of 1000 categories. The ADE20K dataset is recognized for its inclusion of challenging scenes with fine-grained labels... The Wiki Text-103 dataset consists of articles extracted from Wikipedia and is specifically designed to capture long contextual dependencies. [42]
Dataset Splits Yes The Image Net dataset [15, 52] comprises 1.28 million training images and 50, 000 validation images... The training set consists of 20,210 images... Additionally, there are 2,000 images in the validation set and 3,352 images in the test set. ... The validation and test sets contain 218, 000 and 246, 000 running words, respectively...
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or cloud computing instance specifications) used for running the experiments.
Software Dependencies No The paper refers to existing codebases and models (e.g., Dei T baseline, lmtool-fwp) but does not specify versions of programming languages, libraries, or frameworks used (e.g., Python version, PyTorch/TensorFlow version).
Experiment Setup Yes Our baseline model is the Dei T-tiny model [59], which consists of 12 transformer layers, 3 attention heads per layer, and a model dimension of 192... The λ used for our Neu TRENO method is 0.6. ... In our experiments, we set the dimensions of keys, values, and queries to 128, while the training and evaluation context length is set to 256. In this experiment, λ = 0.4 yields the best performance of Neu TRENO language model.