Dynamic Tensor Rematerialization
Authors: Marisa Kirisame, Steven Lyubomirsky, Altan Haan, Jennifer Brennan, Mike He, Jared Roesch, Tianqi Chen, Zachary Tatlock
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement a DTR prototype into Py Torch merely by interposing on tensor allocations and operator calls and collecting lightweight metadata on tensors. We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for checkpointing that is extensible and general, is parameterized by eviction policy, and supports dynamic models. We simulated DTR on a variety of models to empirically evaluate its checkpointing performance across different heuristics and compare it to the static checkpointing schemes examined in Jain et al. (2020). |
| Researcher Affiliation | Collaboration | Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA Octo ML, Seattle, WA School of Computer Science, Carnegie Mellon University, Pittsburgh, PA |
| Pseudocode | Yes | Figure 1: (Top) Pseudocode for DTR s basic logic (independent of heuristic), and (Bottom) DTR s sequence of events in an operator call. Note that Perform Op() may make further recursive calls in order to rematerialize arguments. |
| Open Source Code | Yes | We implemented a DTR prototype1 in Py Torch... 1Publicly available at https://github.com/uwsampl/dtr-prototype |
| Open Datasets | No | The paper mentions using logs from various models like Inception V4, Transformer, ResNet-32, DenseNet-121, LSTM, Tree LSTM, Unrolled GAN, VGG16, and MobileNet. However, it does not explicitly name the datasets these models were trained on or provide concrete access information (link, DOI, formal citation with authors/year) for any specific dataset. |
| Dataset Splits | No | The paper discusses evaluating models and their performance characteristics (compute overhead, memory ratio) but does not provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or citations to predefined splits) or describe a cross-validation setup. |
| Hardware Specification | Yes | All logs were produced by running each model 50 times on a single input on a machine with an NVIDIA Titan V GPU (CUDA 10.1, Cu DNN 7.6.4) and a 16-core AMD Ryzen Threadripper 1950X on Ubuntu 18.04, logging the final warmed-up run. |
| Software Dependencies | Yes | All logs were produced by running each model 50 times on a single input on a machine with an NVIDIA Titan V GPU (CUDA 10.1, Cu DNN 7.6.4) and a 16-core AMD Ryzen Threadripper 1950X on Ubuntu 18.04, logging the final warmed-up run. We instrumented Py Torch (Paszke et al., 2019)... |
| Experiment Setup | Yes | To model a realistic execution setting for DTR, we instrumented Py Torch (Paszke et al., 2019) to log operations performed, metadata on tensors and operators (including sizes, compute times, and parent tensors), and deallocations during the execution of various models. We replayed the logs in a simulator that models the behavior of DTR in the style shown in Figure 1. Model batch sizes are given in parentheses. |