reproducibilityindex.ai

The Reversible Residual Network: Backpropagation Without Storing Activations

Authors: Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of Rev Nets on CIFAR-10, CIFAR-100, and Image Net, establishing nearly identical classiﬁcation accuracy to equally-sized Res Nets, even though the activation storage requirements are independent of depth.
Researcher Affiliation	Collaboration	Aidan N. Gomez 1, Mengye Ren 1,2,3, Raquel Urtasun1,2,3, Roger B. Grosse1,2 University of Toronto1 Vector Institute for Artiﬁcial Intelligence2 Uber Advanced Technologies Group3 {aidan, mren, urtasun, rgrosse}@cs.toronto.edu
Pseudocode	Yes	Algorithm 1 Reversible Residual Block Backprop
Open Source Code	Yes	Code available at https://github.com/renmengye/revnet-public
Open Datasets	Yes	We experimented with Rev Nets on three standard image classiﬁcation benchmarks: CIFAR-10, CIFAR-100, [17] and Image Net [26].
Dataset Splits	Yes	We experimented with Rev Nets on three standard image classiﬁcation benchmarks: CIFAR-10, CIFAR-100, [17] and Image Net [26]. Furthermore, Figure 3 compares Image Net training curves of the Res Net and Rev Net architectures; reversibility did not lead to any noticeable per-iteration slowdown in training. (As discussed above, each Rev Net update is about 1.5-2 more expensive, depending on the implementation.)
Hardware Specification	Yes	For our Image Net experiments, we ﬁxed the mini-batch size to be 256, split across 4 Titan X GPUs with data parallelism [28].
Software Dependencies	No	The paper states 'We implemented the Rev Nets using the Tensor Flow library [1]' but does not specify a version number for TensorFlow or any other software dependency.
Experiment Setup	Yes	For our CIFAR-10/100 experiments, we ﬁxed the mini-batch size to be 100. The learning rate was initialized to 0.1 and decayed by a factor of 10 at 40K and 60K training steps, training for a total of 80K steps. The weight decay constant was set to 2 10 4 and the momentum was set to 0.9. For our Image Net experiments, we ﬁxed the mini-batch size to be 256, split across 4 Titan X GPUs with data parallelism [28]. We employed synchronous SGD [4] with momentum of 0.9. The model was trained for 600K steps, with factor-of-10 learning rate decays scheduled at 160K, 320K, and 480K steps. Weight decay was set to 1 10 4.