Dynamic Tensor Rematerialization

Authors: Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement a DTR prototype into Py Torch merely by interposing on tensor allocations and operator calls and collecting lightweight metadata on tensors. We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for checkpointing that is extensible and general, is parameterized by eviction policy, and supports dynamic models. We simulated DTR on a variety of models to empirically evaluate its checkpointing performance across different heuristics and compare it to the static checkpointing schemes examined in Jain et al. (2020).
Researcher Affiliation Collaboration Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA Octo ML, Seattle, WA School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
Pseudocode Yes Figure 1: (Top) Pseudocode for DTR s basic logic (independent of heuristic), and (Bottom) DTR s sequence of events in an operator call. Note that Perform Op() may make further recursive calls in order to rematerialize arguments.
Open Source Code Yes We implemented a DTR prototype1 in Py Torch... 1Publicly available at https://github.com/uwsampl/dtr-prototype
Open Datasets No The paper mentions using logs from various models like Inception V4, Transformer, ResNet-32, DenseNet-121, LSTM, Tree LSTM, Unrolled GAN, VGG16, and MobileNet. However, it does not explicitly name the datasets these models were trained on or provide concrete access information (link, DOI, formal citation with authors/year) for any specific dataset.
Dataset Splits No The paper discusses evaluating models and their performance characteristics (compute overhead, memory ratio) but does not provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) or describe a cross-validation setup.
Hardware Specification Yes All logs were produced by running each model 50 times on a single input on a machine with an NVIDIA Titan V GPU (CUDA 10.1, Cu DNN 7.6.4) and a 16-core AMD Ryzen Threadripper 1950X on Ubuntu 18.04, logging the final warmed-up run.
Software Dependencies Yes All logs were produced by running each model 50 times on a single input on a machine with an NVIDIA Titan V GPU (CUDA 10.1, Cu DNN 7.6.4) and a 16-core AMD Ryzen Threadripper 1950X on Ubuntu 18.04, logging the final warmed-up run. We instrumented Py Torch (Paszke et al., 2019)...
Experiment Setup Yes To model a realistic execution setting for DTR, we instrumented Py Torch (Paszke et al., 2019) to log operations performed, metadata on tensors and operators (including sizes, compute times, and parent tensors), and deallocations during the execution of various models. We replayed the logs in a simulator that models the behavior of DTR in the style shown in Figure 1. Model batch sizes are given in parentheses.