reproducibilityindex.ai

Module-wise Training of Neural Networks via the Minimizing Movement Scheme

Authors: Skander Karkar, Ibrahim Ayed, Emmanuel de Bézenac, Patrick Gallinari

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we show improved accuracy of module-wise training of various architectures such as Res Nets, Transformers and VGG, when our regularization is added, superior to that of other module-wise training methods and often to end-to-end training, with as much as 60% less memory usage.
Researcher Affiliation	Collaboration	Skander Karkar Criteo, Sorbonne Université Ibrahim Ayed Sorbonne Université, Thales Emmanuel de Bézenac ETH Zurich Patrick Gallinari Criteo, Sorbonne Université
Pseudocode	No	The paper describes its methods using mathematical formulations and textual descriptions, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at github.com/block-wise/module-wise and implementation details are in Appendix E.
Open Datasets	Yes	Tiny Image Net", "CIFAR100 [37]", "STL10 [12]", "Image Net", "CIFAR10 [37]", "MNIST [38]
Dataset Splits	No	The paper mentions training data and test accuracy but does not explicitly provide details about the creation or use of a distinct validation dataset split with specific percentages or counts.
Hardware Specification	Yes	We use NVIDIA Tesla V100 16GB GPUs for the experiments.
Software Dependencies	No	The paper mentions using standard implementations of neural network architectures and optimizers (SGD, Adam W) but does not provide specific version numbers for these software components or any other libraries used.
Experiment Setup	Yes	For sequential and multi-lap sequential training, we use SGD with a learning rate of 0.007. With the exception of the Swin Transformer in Table 4, we use SGD for parallel training with learning rate of 0.003 in all Tables but Table 3 where the learning rate is 0.002. For the Swin Transformer in Table 4, we use the Adam W optimizer with a learning rate of 0.007 and a Cosine LR scheduler. For end-to-end training we use a learning rate of 0.1 that is divided by ﬁve at epochs 120, 160 and 200. Momentum is always 0.9. For parallel and end-to-end training, we train for 300 epochs. For experiments in Section 4.1, we use a batch size of 256, orthogonal initialization [51] with a gain of 0.1, label smoothing of 0.1 and weight decay of 0.0002. The batch size changes to 64 for Table 3 and to 1024 for Table 4.