Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PETRA: Parallel End-to-end Training with Reversible Architectures

Authors: Stéphane Rivaud, Louis Fournier, Thomas Pumir, Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a custom autograd-like training framework for PETRA, and we demonstrate its effectiveness on CIFAR10, Image Net32, and Image Net, achieving competitive accuracies comparable to backpropagation using Res Net-18, Res Net-34, and Res Net-50 models. (4) We validate the efficacy of PETRA through rigorous testing on benchmark datasets such as CIFAR-10, Image Net-32, and Image Net, where it demonstrates robust performance with minimal impact on accuracy.
Researcher Affiliation Collaboration 1ISIR Sorbonne Universit e, Paris 2Flatiron Institute, New York 3Mila, Montr eal 4Concordia Universit e, Montr eal 5LISN Universit e Paris-Saclay, CNRS, Inria, Orsay 6Helm.ai, San Francisco
Pseudocode Yes Algorithm 1 Worker perspective for training in parallel with PETRA, on a stage j, assuming initialized parameters θj and time step t, as well as an accumulation factor k > 1.
Open Source Code Yes (6) Additionally, we provide a flexible reimplementation of the autograd system in Py Torch, specifically tailored for our experimental setup, which is available at https://github.com/stephane-rivaud/PETRA.
Open Datasets Yes We now describe our experimental setup on CIFAR-10 (Krizhevsky, 2009), Image Net32 (Chrabaszcz et al., 2017), and Image Net (Deng et al., 2009).
Dataset Splits Yes We now describe our experimental setup on CIFAR-10 (Krizhevsky, 2009), Image Net32 (Chrabaszcz et al., 2017), and Image Net (Deng et al., 2009). (Figure 4) Validation accuracy of PETRA and backpropagation for a various number of accumulation steps, for a Rev Net18 trained on Image Net with k {1, 2, 4, 8, 16, 32}.
Hardware Specification Yes Our models can run on a single A100, 80GB to easily compare training dynamics, or distributed over 10 or 18 GPUs when training with a Rev Net-18 or a Rev Net-34 or 50.
Software Dependencies No We base our method on Py Torch (Ansel et al., 2024), although we require significant modifications to the Autograd framework in order to manage delayed first-order quantities consistently with PETRA.
Experiment Setup Yes Experimental setup. All our experiments use a standard SGD optimizer with a Nesterov momentum factor of 0.9. We train all models for 300 epochs on CIFAR-10 and 90 epochs on Image Net32 and Image Net. We apply standard data augmentation, including horizontal flip, random cropping, and standard normalization but we do not follow the more involved training settings of Wightman et al. (2021), which potentially leads to higher accuracy. We perform a warm-up of 5 epochs where the learning rate linearly increases from 0 to 0.1, following Goyal et al. (2017). Then, the learning rate is decayed by a factor of 0.1 at epochs 30, 60, and 80 for Image Net32 and Image Net it is decayed at epochs 150 and 225 for CIFAR-10. We use a weight decay of 5e-4 for CIFAR-10 and 1e-4 for Image Net32 and Image Net. As suggested in Goyal et al. (2017), we do not apply weight decay on the batch norm learnable parameters and biases of affine and convolutional layers. For our standard backpropagation experiments, we follow the standard practice and use a batch size of 128 on Image Net32 and CIFAR-10, and 256 on Image Net32. However, we made a few adaptations to train our models with PETRA. As suggested by Zhuang et al. (2020; 2021a), we employ an accumulation factor k and a batch size of 64, which allows to reduce the effective staleness during training: in this case, k batches of data must be successively processed before updating the parameters of a stage (see Alg. 1). Such gradient accumulation however also increases the effective batch size, and we apply the training recipe used in Goyal et al. (2017) to adjust the learning rate; note that we use the average of the accumulated gradients instead of the sum. The base learning rate is thus given by the formula lr = 0.1 * 64k / 256, with k the accumulation factor.