Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Cost-Efficient LLM Training with Lifetime-Aware Tensor Offloading via GPUDirect Storage
Authors: Ziqi Yuan, Haoyang Zhang, Yirui Zhou, Apoorve Mohan, I-Hsin Chung, Seetharami Seelam, Jian Huang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present the design and implementation of a new lifetime-aware tensor offloading framework for GPU memory expansion using low-cost PCIe-based solid-state drives (SSDs). Our framework, TERAIO, is developed explicitly for large language model (LLM) training with multiple GPUs and multiple SSDs. Its design is driven by our observation that the active tensors take only a small fraction (1.7% on average) of allocated GPU memory in each LLM training iteration, the inactive tensors are usually large and will not be used for a long period of time, creating ample opportunities for offloading/prefetching tensors to/from slow SSDs without stalling the GPU training process. TERAIO accurately estimates the lifetime (active period of time in GPU memory) of each tensor with the profiling of the first few iterations in the training process. With the tensor lifetime analysis, TERAIO will generate an optimized tensor offloading/prefetching plan and integrate it into the compiled LLM program via Py Torch. TERAIO has a runtime tensor migration engine to execute the offloading/prefetching plan via GPUDirect storage, which allows direct tensor migration between GPUs and SSDs for alleviating the CPU bottleneck and maximizing the SSD bandwidth utilization. In comparison with state-of-the-art studies such as Ze RO-Offload and Ze RO-Infinity, we show that TERAIO improves the training performance of various LLMs by 1.47 on average, and achieves 80.7% of the ideal performance assuming unlimited GPU memory. |
| Researcher Affiliation | Collaboration | Ziqi Yuan1, Haoyang Zhang1, Yirui Eric Zhou1, Apoorve Mohan2, I-Hsin Chung2, Seetharami Seelam2, Jian Huang1 1University of Illinois Urbana-Champaign, 2IBM Research EMAIL EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Lifetime-Aware Tensor Migration Planning Require: Set of tensor inactive periods I = {(ti, si, starti, endi)} where ti is tensor ID, si is size, and [starti, endi] defines the kernel range of inactivity; GPU memory capacity MGP U; Estimation of total amount of required GPU memory M = [m0, m1, ..., m N 1] across N kernels; Kernel execution times T = [τ0, τ1, ..., τN 1]; I/O bandwidth usage states Ensure: Migration plan list P = {(ti, trigger_time, deadline, target)} |
| Open Source Code | Yes | We submit the code in the supplemental materials. |
| Open Datasets | Yes | We use C4 [36] as our training dataset. To study how different memory demands impact the performance of TERAIO, we use batch sizes ranging from 16 to 128 and sequence lengths from 1,024 to 8,192. |
| Dataset Splits | No | The paper mentions using "C4 [36] as our training dataset" but does not explicitly describe how this dataset is split into training, validation, and test sets. It implies using the C4 for training but does not detail the splits used for evaluation or internal validation. |
| Hardware Specification | Yes | Table 2: Our GPU server configuration. GPU 2 NVIDIA H100 NVL GPU Memory 94GB HBM per GPU CPU 2 AMD EPYC 9334 CPU Memory 1.5TB DDR5 (64GB 24) Interconnect PCIe Gen5 SSDs 8 Samsung 990 PRO 2TB SSD Read/Write Bandwidth 6.7/6.5 GB/s per SSD |
| Software Dependencies | Yes | We use Py Torch 2.5.0 [30] and Torch Titan [22] to train LLMs. |
| Experiment Setup | Yes | We use C4 [36] as our training dataset. To study how different memory demands impact the performance of TERAIO, we use batch sizes ranging from 16 to 128 and sequence lengths from 1,024 to 8,192. ... In terms of training precision, we use full-precision training in all experiments. ... We show the performance-critical parameters in the table below [Table 4]. With these settings, we ensure Ze RO-Infinity achieves reasonable performance with our hardware setup. We enabled the pipeline_read/write parameters to optimize computation and data I/O overlap during optimizer state updates. We tuned parameters pin_memory, buffer_count, and buffer_size to optimize tensor offloading throughput. For param_persistence_threshold and model_persistence_threshold, we use their default values. |