Module-wise Training of Neural Networks via the Minimizing Movement Scheme

Authors: Skander Karkar, Ibrahim Ayed, Emmanuel de Bézenac, Patrick Gallinari

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we show improved accuracy of module-wise training of various architectures such as Res Nets, Transformers and VGG, when our regularization is added, superior to that of other module-wise training methods and often to end-to-end training, with as much as 60% less memory usage.
Researcher Affiliation Collaboration Skander Karkar Criteo, Sorbonne Université Ibrahim Ayed Sorbonne Université, Thales Emmanuel de Bézenac ETH Zurich Patrick Gallinari Criteo, Sorbonne Université
Pseudocode No The paper describes its methods using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at github.com/block-wise/module-wise and implementation details are in Appendix E.
Open Datasets Yes Tiny Image Net", "CIFAR100 [37]", "STL10 [12]", "Image Net", "CIFAR10 [37]", "MNIST [38]
Dataset Splits No The paper mentions training data and test accuracy but does not explicitly provide details about the creation or use of a distinct validation dataset split with specific percentages or counts.
Hardware Specification Yes We use NVIDIA Tesla V100 16GB GPUs for the experiments.
Software Dependencies No The paper mentions using standard implementations of neural network architectures and optimizers (SGD, Adam W) but does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup Yes For sequential and multi-lap sequential training, we use SGD with a learning rate of 0.007. With the exception of the Swin Transformer in Table 4, we use SGD for parallel training with learning rate of 0.003 in all Tables but Table 3 where the learning rate is 0.002. For the Swin Transformer in Table 4, we use the Adam W optimizer with a learning rate of 0.007 and a Cosine LR scheduler. For end-to-end training we use a learning rate of 0.1 that is divided by five at epochs 120, 160 and 200. Momentum is always 0.9. For parallel and end-to-end training, we train for 300 epochs. For experiments in Section 4.1, we use a batch size of 256, orthogonal initialization [51] with a gain of 0.1, label smoothing of 0.1 and weight decay of 0.0002. The batch size changes to 64 for Table 3 and to 1024 for Table 4.