Efficient Combination of Rematerialization and Offloading for Training DNNs
Authors: Olivier Beaumont, Lionel Eyraud-Dubois, Alena Shilova
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We measured running times and memory occupation of several networks from Py Torch torchvision package: resnet, densenet and inception, with a batch size of 16 and images of 500 500 pixels. The simulation results presented here were obtained using 4 cores of a 24-core Haswell Intel R Xeon R E5-2680 v3 at 2,5 GHz, with 128GB of memory, and used about one hour of computation. |
| Researcher Affiliation | Academia | Inria Bordeaux {olivier.beaumont, lionel.eyraud-dubois, alena.shilova}@inria.fr |
| Pseudocode | No | To save space, we will not detail in the main paper all the equations of the dynamic program, which involves a large number of cases. We will focus in the main part of the paper on the intuitions and the general working principle of the dynamic program and refer the reader to Appendix B for detailed derivations and proofs. |
| Open Source Code | Yes | We have implemented a preliminary version of our best performing algorithms (pofo and autocapper) and made them available in rotor [1]. Rotor. https://gitlab.inria.fr/hiepacs/rotor, 2019. |
| Open Datasets | Yes | We measured running times and memory occupation of several networks from Py Torch torchvision package: resnet, densenet and inception, with a batch size of 16 and images of 500 500 pixels. |
| Dataset Splits | No | The authors marked 'N/A' for specifying all training details including data splits in their reproducibility checklist. |
| Hardware Specification | Yes | Time measurements were performed on a NVidia Tesla V100 GPU. We also measured the bandwidth obtained when transferring Py Torch tensors from and to the GPU and obtained 12GB/s. The simulation results presented here were obtained using 4 cores of a 24-core Haswell Intel R Xeon R E5-2680 v3 at 2,5 GHz, with 128GB of memory, and used about one hour of computation. |
| Software Dependencies | Yes | In the recent release of Py Torch 1.10, the introduction of the saved_tensors_hooks() feature makes it possible to implement the offloading technique described in this paper. |
| Experiment Setup | Yes | We measured running times and memory occupation of several networks from Py Torch torchvision package: resnet, densenet and inception, with a batch size of 16 and images of 500 500 pixels. |