Outlier-Efficient Hopfield Layers for Large Transformer-Based Models

Authors: Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Haozheng Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we validate the proposed method on 3 common large transformer-based and 1 Hopfield-based models (BERT (Devlin et al., 2019), Open Pre-trained Transformer (OPT) (Zhang et al., 2022), Vision Transformer (Vi T) (Dosovitskiy et al., 2020) and STan Hop-Net (Wu et al., 2024b)). Specifically, Out Eff Hop reduces average kurtosis and maximum infinity norm by 22+% and 26+%, respectively1, and improves the same metrics by an average of 3% and 4% compared to 3 variants of STan Hop-Net and ranks among the top two in outlier efficiency in 25 out of 30 settings.
Researcher Affiliation Academia 1Department of Computer Science, Northwestern University, Evanston, USA 2Department of Physics, National Taiwan University, Taipei, Taiwan 3Department of Statistics and Data Science, Northwestern University, Evanston, USA.
Pseudocode No No pseudocode or algorithm blocks are explicitly presented or labeled in the paper.
Open Source Code Yes Code is available at Git Hub; future updates are on ar Xiv.
Open Datasets Yes Datasets. We use 4 real-world datasets: Bookcorpus (Zhu et al., 2015), wiki40b/en (Guo et al., 2020), Image Net-1k (Russakovsky et al., 2015) and ETTh1 (Zhou et al., 2021). The first two are for language models, i.e. OPT and BERT, the third is for vision model, i.e. Vi T, and the last is for time series model, i.e. STan Hop-Net.
Dataset Splits Yes We then train these models from scratch and evaluate them on the validation set. We divide these datasets into training, validation, and test sets with a ratio of 14/5/5.
Hardware Specification Yes Our experimental setup used a Slurm system with two 80G A100 GPUs and a 24-core Intel(R) Xeon(R) Gold 6338 CPU at 2.00GHz.
Software Dependencies No No specific software dependencies with version numbers are listed in the paper's main text.
Experiment Setup Yes Table 3. Hyperparameter used in the fast convergence task. parameter values learning rate 1e 4 embedding dimension 512 Feed forward dimension 1024 Dropout 0.3 activation function GELU Epoch 100 Batch size 512 Model optimizer Adam Patch size 32