Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transformer Fusion with Optimal Transport
Authors: Moritz Imfeld, Jacopo Graldi, Marco Giordano, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed approach is evaluated on both image classification tasks via Vision Transformer and natural language modeling tasks using BERT. Our approach consistently outperforms vanilla fusion, and, after a surprisingly short finetuning, also outperforms the individual converged parent models. ... We evaluate the quality of our approach with two prominent transformer-based architectures: the Vi T (Dosovitskiy et al., 2020) and BERT (Devlin et al., 2018). Our focus is to assess the performance and robustness of our proposed fusion techniques in both image and NLP domains. |
| Researcher Affiliation | Academia | Moritz Imfeld , Jacopo Graldi , Marco Giordano , Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh ETH Zurich, Switzerland EMAIL |
| Pseudocode | No | The paper does not contain any explicit pseudocode or algorithm blocks. It uses diagrams (e.g., Figure 1, Figure 2, Figure 6) to illustrate concepts like TM flow graphs and cross-head alignment, but these are not formatted as pseudocode or algorithms. |
| Open Source Code | Yes | Code is available at https://github.com/graldij/transformer-fusion. |
| Open Datasets | Yes | We conducted our experiments on multiple well-known image classification datasets: CIFAR10, CIFAR100, TINY IMAGENET, and IMAGENET-1K. ... We train from scratch multiple BERT models on the masked language modeling (MLM) task over a subset of the Wikipedia dataset, publicly available on the Hugging Face Hub. |
| Dataset Splits | No | The paper mentions 'validation loss' and 'validation accuracy' in training curves (e.g., Figures 7, 8, 9) and 'Finetuning curves on the validation set' (Figure 12), implying the use of a validation set. However, it does not explicitly provide specific percentages, sample counts, or a detailed methodology for creating the training, validation, and test splits needed for reproduction. |
| Hardware Specification | Yes | In Tab. 9 we provide profiling information for our most used Vi T configuration. The experiments were run on an RTX 4090. |
| Software Dependencies | No | We used Hugging Face both for the implementation of the Vi T and for retrieving the datasets. ... We use the Vi T implementation available on Hugging Face3 and we train it from scratch... We use the Simple Vi T class from vit-pytorch6 and we train it from scratch... We use the BERT implementation available on Hugging Face8 together with the pre-trained bert-base-uncased tokenizer 9. ... The paper does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | Model Training. First, we train individual models from scratch on each dataset until convergence. ... Table 6: Training details for the Vi T models trained on CIFAR and Tiny Image Net models. Optimizer Adam W, Weight decay 5e-5, Learning Rate Maximum value of 1e-3, LR Scheduler Cosine scheduling, Warmup 0.025% epochs of warmup, Training Epochs CIFAR 2500, Batch size CIFAR 1024, Gradient accumulation CIFAR 2, Random seed 0-4. ... Table 8: Training details for the Vi T models trained on Imagenet. ... Table 11: Training details for the BERT models. |