Layer-wise linear mode connectivity
Authors: Linara Adilova, Maksym Andriushchenko, Michael Kamp, Asja Fischer, Martin Jaggi
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Based on our empirical and theoretical investigation, we introduce a novel notion of the layer-wise linear connectivity, and show that deep networks do not have layer-wise barriers between them. |
| Researcher Affiliation | Academia | Linara Adilova Ruhr University Bochum, EPFL Maksym Andriushchenko EPFL Michael Kamp IKIM UK Essen, RUB, and Monash University Asja Fischer Ruhr University Bochum Martin Jaggi EPFL |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code for the experiments is published https://github.com/link-er/layer-wise-lmc. |
| Open Datasets | Yes | CIFAR-10 with Res Net18. We trained a small GPT-like model with 12 layers on Wikitext. We train two fully connected networks with 3 hidden layers on MNIST. Mobile Net trained on CIFAR-100. Domain Net dataset (Peng et al., 2019). |
| Dataset Splits | No | The paper mentions 'training set' and 'test set' but does not explicitly describe a separate 'validation' set or its specific split details for reproduction. |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper refers to training setups and implementations taken from specific GitHub repositories (e.g., 'https://github.com/epfml/llm-baselines', 'https://github.com/jhoon-oh/FedBABU'), but it does not explicitly list software dependencies with version numbers. |
| Experiment Setup | Yes | We train Res Net18 without normalization layers using warm-up learning rate schedule: starting from 0.0001 with linear mode for 100 epochs reaching 0.05. Afterwards cosine annealing is used as a schedule for learning rate decay. Batchsize is 64, training is happening for 200 epochs with SGD optimizer, momentum 0.9 and weight decay 5E 4. For VGG11 the training setup is the following: batch size 128, learning rate 0.05, with step wise learning rate scheduler multiplying learning rate by 0.5 every 30 steps. The training is performed for 200 epochs with SGD with momentum 0.9 and weight decay 5E 4. Mobile Net implementation and training hyperparameters were taken from https://github.com/jhoon-oh/FedBABU. In particular we use batchsize 128, learning rate 0.1 and decay it by 0.1 on half training and 0.75 of training. Training is done for 320 epochs. |