Transformer Fusion with Optimal Transport

Authors: Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. ... We evaluate the quality of our approach with two prominent transformer-based architectures: the Vi T (Dosovitskiy et al., 2020) and BERT (Devlin et al., 2018). Our focus is to assess the performance and robustness of our proposed fusion techniques in both image and NLP domains.
Researcher Affiliation Academia Moritz Imfeld , Jacopo Graldi , Marco Giordano , Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh ETH Zurich, Switzerland {moimfeld, graldij, mgiordano}@ethz.ch
Pseudocode No The paper does not contain any explicit pseudocode or algorithm blocks. It uses diagrams (e.g., Figure 1, Figure 2, Figure 6) to illustrate concepts like TM flow graphs and cross-head alignment, but these are not formatted as pseudocode or algorithms.
Open Source Code Yes Code is available at https://github.com/graldij/transformer-fusion.
Open Datasets Yes We conducted our experiments on multiple well-known image classification datasets: CIFAR10, CIFAR100, TINY IMAGENET, and IMAGENET-1K. ... We train from scratch multiple BERT models on the masked language modeling (MLM) task over a subset of the Wikipedia dataset, publicly available on the Hugging Face Hub.
Dataset Splits No The paper mentions 'validation loss' and 'validation accuracy' in training curves (e.g., Figures 7, 8, 9) and 'Finetuning curves on the validation set' (Figure 12), implying the use of a validation set. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for creating the training, validation, and test splits needed for reproduction.
Hardware Specification Yes In Tab. 9 we provide profiling information for our most used Vi T configuration. The experiments were run on an RTX 4090.
Software Dependencies No We used Hugging Face both for the implementation of the Vi T and for retrieving the datasets. ... We use the Vi T implementation available on Hugging Face3 and we train it from scratch... We use the Simple Vi T class from vit-pytorch6 and we train it from scratch... We use the BERT implementation available on Hugging Face8 together with the pre-trained bert-base-uncased tokenizer 9. ... The paper does not provide specific version numbers for these software dependencies.
Experiment Setup Yes Model Training. First, we train individual models from scratch on each dataset until convergence. ... Table 6: Training details for the Vi T models trained on CIFAR and Tiny Image Net models. Optimizer Adam W, Weight decay 5e-5, Learning Rate Maximum value of 1e-3, LR Scheduler Cosine scheduling, Warmup 0.025% epochs of warmup, Training Epochs CIFAR 2500, Batch size CIFAR 1024, Gradient accumulation CIFAR 2, Random seed 0-4. ... Table 8: Training details for the Vi T models trained on Imagenet. ... Table 11: Training details for the BERT models.