Mimetic Initialization of Self-Attention Layers

Authors: Asher Trockman, J Zico Kolter

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We explore the weights of such pre-trained Transformers (particularly for vision) to attempt to find reasons for this discrepancy. Surprisingly, we find that simply initializing the weights of self-attention layers so that they look more like their pre-trained counterparts allows us to train vanilla Transformers faster and to higher final accuracies, particularly on vision tasks such as CIFAR-10 and Image Net classification, where we see gains in accuracy of over 5% and 4%, respectively.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Bosch Center for AI.
Pseudocode No The paper provides mathematical formulas for its initialization scheme but does not present them in a pseudocode or algorithm block format.
Open Source Code No The paper does not provide any explicit statements about open-sourcing the code or links to a code repository.
Open Datasets Yes Our initialization shows strong advantages for Vi Ts, allowing gains of up to 5% when training on small datasets like CIFAR-10, and up to 4% for larger datasets, i.e., Image Net1k within a standard Res Net-style training pipeline. We also see smaller performance gains on language modeling tasks such as Wiki Text-103. To further show that our initialization is not overfit to CIFAR-10 or Image Net in particular, we present results for CIFAR-100, SVHN, and Tiny Image Net using our initialization.
Dataset Splits No The paper mentions training on various datasets and conducting ablations, but it does not specify the exact training/validation/test splits (e.g., percentages or counts) used for reproducibility. It implies standard splits but doesn't state them.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU models, CPU types, or memory specifications) used for running the experiments.
Software Dependencies No The paper describes the training pipeline and parameters, but it does not specify software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes Setup We train all Vi Ts using a simple pipeline: we use Rand Augment and Cutout for augmentation, a batch size of 512, Adam W with 3 10 3 learning rate, 0.01 weight decay, and 100 epochs. We use a vanilla Vi T with embedding dimension 192, depth 12, patch size 2, and input size 32 unless otherwise noted (Vi T-Tiny). We use a class token and sinusoidal position embeddings. We use α1 = β1 = 0.7 and α2 = β2 = 0.4 for all experiments.