The Reversible Residual Network: Backpropagation Without Storing Activations

Authors: Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of Rev Nets on CIFAR-10, CIFAR-100, and Image Net, establishing nearly identical classification accuracy to equally-sized Res Nets, even though the activation storage requirements are independent of depth.
Researcher Affiliation Collaboration Aidan N. Gomez 1, Mengye Ren 1,2,3, Raquel Urtasun1,2,3, Roger B. Grosse1,2 University of Toronto1 Vector Institute for Artificial Intelligence2 Uber Advanced Technologies Group3 {aidan, mren, urtasun, rgrosse}@cs.toronto.edu
Pseudocode Yes Algorithm 1 Reversible Residual Block Backprop
Open Source Code Yes Code available at https://github.com/renmengye/revnet-public
Open Datasets Yes We experimented with Rev Nets on three standard image classification benchmarks: CIFAR-10, CIFAR-100, [17] and Image Net [26].
Dataset Splits Yes We experimented with Rev Nets on three standard image classification benchmarks: CIFAR-10, CIFAR-100, [17] and Image Net [26]. Furthermore, Figure 3 compares Image Net training curves of the Res Net and Rev Net architectures; reversibility did not lead to any noticeable per-iteration slowdown in training. (As discussed above, each Rev Net update is about 1.5-2 more expensive, depending on the implementation.)
Hardware Specification Yes For our Image Net experiments, we fixed the mini-batch size to be 256, split across 4 Titan X GPUs with data parallelism [28].
Software Dependencies No The paper states 'We implemented the Rev Nets using the Tensor Flow library [1]' but does not specify a version number for TensorFlow or any other software dependency.
Experiment Setup Yes For our CIFAR-10/100 experiments, we fixed the mini-batch size to be 100. The learning rate was initialized to 0.1 and decayed by a factor of 10 at 40K and 60K training steps, training for a total of 80K steps. The weight decay constant was set to 2 10 4 and the momentum was set to 0.9. For our Image Net experiments, we fixed the mini-batch size to be 256, split across 4 Titan X GPUs with data parallelism [28]. We employed synchronous SGD [4] with momentum of 0.9. The model was trained for 600K steps, with factor-of-10 learning rate decays scheduled at 160K, 320K, and 480K steps. Weight decay was set to 1 10 4.