The Reversible Residual Network: Backpropagation Without Storing Activations
Authors: Aidan N. Gomez, Mengye Ren, Raquel Urtasun, Roger B. Grosse
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of Rev Nets on CIFAR-10, CIFAR-100, and Image Net, establishing nearly identical classification accuracy to equally-sized Res Nets, even though the activation storage requirements are independent of depth. |
| Researcher Affiliation | Collaboration | Aidan N. Gomez 1, Mengye Ren 1,2,3, Raquel Urtasun1,2,3, Roger B. Grosse1,2 University of Toronto1 Vector Institute for Artificial Intelligence2 Uber Advanced Technologies Group3 {aidan, mren, urtasun, rgrosse}@cs.toronto.edu |
| Pseudocode | Yes | Algorithm 1 Reversible Residual Block Backprop |
| Open Source Code | Yes | Code available at https://github.com/renmengye/revnet-public |
| Open Datasets | Yes | We experimented with Rev Nets on three standard image classification benchmarks: CIFAR-10, CIFAR-100, [17] and Image Net [26]. |
| Dataset Splits | Yes | We experimented with Rev Nets on three standard image classification benchmarks: CIFAR-10, CIFAR-100, [17] and Image Net [26]. Furthermore, Figure 3 compares Image Net training curves of the Res Net and Rev Net architectures; reversibility did not lead to any noticeable per-iteration slowdown in training. (As discussed above, each Rev Net update is about 1.5-2 more expensive, depending on the implementation.) |
| Hardware Specification | Yes | For our Image Net experiments, we fixed the mini-batch size to be 256, split across 4 Titan X GPUs with data parallelism [28]. |
| Software Dependencies | No | The paper states 'We implemented the Rev Nets using the Tensor Flow library [1]' but does not specify a version number for TensorFlow or any other software dependency. |
| Experiment Setup | Yes | For our CIFAR-10/100 experiments, we fixed the mini-batch size to be 100. The learning rate was initialized to 0.1 and decayed by a factor of 10 at 40K and 60K training steps, training for a total of 80K steps. The weight decay constant was set to 2 10 4 and the momentum was set to 0.9. For our Image Net experiments, we fixed the mini-batch size to be 256, split across 4 Titan X GPUs with data parallelism [28]. We employed synchronous SGD [4] with momentum of 0.9. The model was trained for 600K steps, with factor-of-10 learning rate decays scheduled at 160K, 320K, and 480K steps. Weight decay was set to 1 10 4. |