Model Fusion via Optimal Transport

Authors: Sidak Pal Singh, Martin Jaggi

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our model fusion approach on standard image classification datasets, like CIFAR10 with commonly used convolutional neural networks (CNNs) such as VGG11 [23] and residual networks like Res Net18 [24]; and on MNIST, we use a fully connected network with 3 hidden layers of size 400, 200, 100, which we refer to as MLPNET. As baselines, we mention the performance of prediction ensembling and vanilla averaging, besides that of individual models.
Researcher Affiliation Academia Sidak Pal Singh ETH Zurich, Switzerland contact@sidakpal.com; Martin Jaggi EPFL, Switzerland martin.jaggi@epfl.ch
Pseudocode Yes Algorithm 1: Model Fusion (with ψ = { acts , wts } alignment)
Open Source Code Yes The code is available at the following link, https://github.com/sidak/otfusion.
Open Datasets Yes We test our model fusion approach on standard image classification datasets, like CIFAR10 with commonly used convolutional neural networks (CNNs) such as VGG11 [23] and residual networks like Res Net18 [24]; and on MNIST, we use a fully connected network with 3 hidden layers of size 400, 200, 100, which we refer to as MLPNET.
Dataset Splits No The paper mentions training and test sets and states that "All the performance scores are test accuracies," with full experimental details in an appendix not provided. It does not explicitly specify validation dataset splits or percentages.
Hardware Specification Yes To give a concrete estimate, the time taken to fuse six VGG11 models is 15 seconds on 1 Nvidia V100 GPU (c.f. Section S1.4 for more details).
Software Dependencies No The paper mentions using "exact OT solvers" and optimization methods like SGD, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow, or specific library versions) that would be needed for replication.
Experiment Setup Yes We typically consider a mini-batch of 100 to 400 samples for these experiments. Both are trained on their portions of the data for 10 epochs , and other training settings are identical. The finetuning scores for vanilla and OT averaging correspond to their best obtained results, when retrained with several finetuning learning rate schedules for a total of 100 and 120 epochs in case of VGG11and RESNET18 respectively.