Outlier-Efficient Hopfield Layers for Large Transformer-Based Models
Authors: Jerry Yao-Chieh Hu, Pei-Hsuan Chang, Haozheng Luo, Hong-Yu Chen, Weijian Li, Wei-Po Wang, Han Liu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate the proposed method on 3 common large transformer-based and 1 Hopfield-based models (BERT (Devlin et al., 2019), Open Pre-trained Transformer (OPT) (Zhang et al., 2022), Vision Transformer (Vi T) (Dosovitskiy et al., 2020) and STan Hop-Net (Wu et al., 2024b)). Specifically, Out Eff Hop reduces average kurtosis and maximum infinity norm by 22+% and 26+%, respectively1, and improves the same metrics by an average of 3% and 4% compared to 3 variants of STan Hop-Net and ranks among the top two in outlier efficiency in 25 out of 30 settings. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Northwestern University, Evanston, USA 2Department of Physics, National Taiwan University, Taipei, Taiwan 3Department of Statistics and Data Science, Northwestern University, Evanston, USA. |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly presented or labeled in the paper. |
| Open Source Code | Yes | Code is available at Git Hub; future updates are on ar Xiv. |
| Open Datasets | Yes | Datasets. We use 4 real-world datasets: Bookcorpus (Zhu et al., 2015), wiki40b/en (Guo et al., 2020), Image Net-1k (Russakovsky et al., 2015) and ETTh1 (Zhou et al., 2021). The first two are for language models, i.e. OPT and BERT, the third is for vision model, i.e. Vi T, and the last is for time series model, i.e. STan Hop-Net. |
| Dataset Splits | Yes | We then train these models from scratch and evaluate them on the validation set. We divide these datasets into training, validation, and test sets with a ratio of 14/5/5. |
| Hardware Specification | Yes | Our experimental setup used a Slurm system with two 80G A100 GPUs and a 24-core Intel(R) Xeon(R) Gold 6338 CPU at 2.00GHz. |
| Software Dependencies | No | No specific software dependencies with version numbers are listed in the paper's main text. |
| Experiment Setup | Yes | Table 3. Hyperparameter used in the fast convergence task. parameter values learning rate 1e 4 embedding dimension 512 Feed forward dimension 1024 Dropout 0.3 activation function GELU Epoch 100 Batch size 512 Model optimizer Adam Patch size 32 |