Transformer Fusion with Optimal Transport
Authors: Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. ... We evaluate the quality of our approach with two prominent transformer-based architectures: the Vi T (Dosovitskiy et al., 2020) and BERT (Devlin et al., 2018). Our focus is to assess the performance and robustness of our proposed fusion techniques in both image and NLP domains. |
| Researcher Affiliation | Academia | Moritz Imfeld , Jacopo Graldi , Marco Giordano , Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh ETH Zurich, Switzerland {moimfeld, graldij, mgiordano}@ethz.ch |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. It uses diagrams (e.g., Figure 1, Figure 2, Figure 6) to illustrate concepts like TM flow graphs and cross-head alignment, but these are not formatted as pseudocode or algorithms. |
| Open Source Code | Yes | Code is available at https://github.com/graldij/transformer-fusion. |
| Open Datasets | Yes | We conducted our experiments on multiple well-known image classification datasets: CIFAR10, CIFAR100, TINY IMAGENET, and IMAGENET-1K. ... We train from scratch multiple BERT models on the masked language modeling (MLM) task over a subset of the Wikipedia dataset, publicly available on the Hugging Face Hub. |
| Dataset Splits | No | The paper mentions 'validation loss' and 'validation accuracy' in training curves (e.g., Figures 7, 8, 9) and 'Finetuning curves on the validation set' (Figure 12), implying the use of a validation set. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for creating the training, validation, and test splits needed for reproduction. |
| Hardware Specification | Yes | In Tab. 9 we provide profiling information for our most used Vi T configuration. The experiments were run on an RTX 4090. |
| Software Dependencies | No | We used Hugging Face both for the implementation of the Vi T and for retrieving the datasets. ... We use the Vi T implementation available on Hugging Face3 and we train it from scratch... We use the Simple Vi T class from vit-pytorch6 and we train it from scratch... We use the BERT implementation available on Hugging Face8 together with the pre-trained bert-base-uncased tokenizer 9. ... The paper does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Model Training. First, we train individual models from scratch on each dataset until convergence. ... Table 6: Training details for the Vi T models trained on CIFAR and Tiny Image Net models. Optimizer Adam W, Weight decay 5e-5, Learning Rate Maximum value of 1e-3, LR Scheduler Cosine scheduling, Warmup 0.025% epochs of warmup, Training Epochs CIFAR 2500, Batch size CIFAR 1024, Gradient accumulation CIFAR 2, Random seed 0-4. ... Table 8: Training details for the Vi T models trained on Imagenet. ... Table 11: Training details for the BERT models. |