Mimetic Initialization of Self-Attention Layers
Authors: Asher Trockman, J Zico Kolter
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to find reasons for this discrepancy. Surprisingly, we find that simply initializing the weights of self-attention layers so that they look more like their pre-trained counterparts allows us to train vanilla Transformers faster and to higher final accuracies, particularly on vision tasks such as CIFAR-10 and Image Net classification, where we see gains in accuracy of over 5% and 4%, respectively. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Bosch Center for AI. |
| Pseudocode | No | The paper provides mathematical formulas for its initialization scheme but does not present them in a pseudocode or algorithm block format. |
| Open Source Code | No | The paper does not provide any explicit statements about open-sourcing the code or links to a code repository. |
| Open Datasets | Yes | Our initialization shows strong advantages for Vi Ts, allowing gains of up to 5% when training on small datasets like CIFAR-10, and up to 4% for larger datasets, i.e., Image Net1k within a standard Res Net-style training pipeline. We also see smaller performance gains on language modeling tasks such as Wiki Text-103. To further show that our initialization is not overfit to CIFAR-10 or Image Net in particular, we present results for CIFAR-100, SVHN, and Tiny Image Net using our initialization. |
| Dataset Splits | No | The paper mentions training on various datasets and conducting ablations, but it does not specify the exact training/validation/test splits (e.g., percentages or counts) used for reproducibility. It implies standard splits but doesn't state them. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper describes the training pipeline and parameters, but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8'). |
| Experiment Setup | Yes | Setup We train all Vi Ts using a simple pipeline: we use Rand Augment and Cutout for augmentation, a batch size of 512, Adam W with 3 10 3 learning rate, 0.01 weight decay, and 100 epochs. We use a vanilla Vi T with embedding dimension 192, depth 12, patch size 2, and input size 32 unless otherwise noted (Vi T-Tiny). We use a class token and sinusoidal position embeddings. We use α1 = β1 = 0.7 and α2 = β2 = 0.4 for all experiments. |