SpeedLoader: An I/O efficient scheme for heterogeneous and distributed LLM operation

Authors: Yiqi Zhang, Yang You

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our proposed scheme significantly enhances training and inference throughput of large language models under restrictive computational resources. We confirmed a large leap in effective compute time by looking into the kernel-level runtime behavior of our trials, where the MFUs can achieve up to 51%. Compared to the state-of-the-art approach, our framework robustly achieves remarkable speedups from 3x to 30x in multiple distributed heterogeneous training setups and inference speedups of 1.5x to 2.35x without compromising arithmetic precision. We evaluated Speed Loader s performance with LLa MA-2 and OPT [17, 18] at different sizes. Results showed that Speed Loader can robustly achieve a training speedup of 3.5x to 30x and over 50% model FLOPs utilization (MFU) on multiple platforms compared to state-of-the-art approaches.
Researcher Affiliation Academia Yiqi Zhang Institute of Data Science yiqi.zhang@u.nus.edu Yang You School of Computing youy@comp.nus.edu.sg National University of Singapore Singapore, 119077
Pseudocode Yes Algorithm 1 Pseudo-code for tensor exchange. x embedding(input_ids) for i = 1 to len(batches) do Offload x to pinned_x[lid][i 1] x buffer Register_hook(x) Fetch pinned_x[lid 1][i + 1] to buffer x layer(x) end for output_logits output_embeddings(x) procedure BACKWARD_HOOK(x, i, lid) Offload x.grad to pinned_x[lid 1][i 1] x act_buffer x.grad grad_buffer Fetch pinned_x[lid 1][x 1] to act_buffer Fetch pinned_x[lid][x 1] to grad_buffer backward(x) end procedure
Open Source Code Yes To facilitate reproduction of experimental results, we provide an public repository2. As an example setting, we recommend readers reproduce part of aforementioned results with instances of A2 VMs on Google Cloud. This SKU has similar specifications to the computational resources of this research. Our code repository is publicly available. Users can follow the instructions in Sec. A.6.
Open Datasets Yes We evaluated Speed Loader s performance with LLa MA-2 and OPT [17, 18] at different sizes. We pretrained a 7B and a 13B model following the corresponding configuration of LLa MA-2. The trials ran on Wikipedia, Open Web Text and C4 datasets for a cutoff time.
Dataset Splits No The paper does not explicitly provide details about training, validation, and test dataset splits for the experiments.
Hardware Specification Yes Our experiments were performed on VMs from an Infrastructure as a Service (Iaa S) provider and computing nodes from a high-performance cluster (HPC). Our benchmarks were conducted on VMs with 16 NVIDIA A100-40GB GPUs with NVLink. Each VM is equipped with 96 cores of v CPU and 1360GB RAM. Functionality tests were conducted on HPC nodes with NVIDIA A100-40GB GPUs and HPE Slingshot Interconnection in Dragonfly topology. Platform specifications can be found in Tab. 1. Table 1 also lists NVIDIA H100-96GB, NVIDIA V100S-32GB, and NVIDIA A6000.
Software Dependencies No The paper mentions software components like "Deep Speed Ze RO++," "Py Torch," and "CUDA streams," but it does not specify exact version numbers for these software dependencies.
Experiment Setup Yes Hyperparameter selection is a critical aspect of both the training and inference phases of LLMs. Our proposed method expands the search space for hyperparameter tuning, highlighting the importance of a swift tuning strategy. Based on observations in Section 5.1, we have developed a one-shot hyperparameter tuning strategy that not only addresses these new dimensions (i.e., sub-batch size, effective batch size and number of on-device layers) but is also compatible with existing framework-provided tuning tools. The training results are shown in Tab. 2. These trials were conducted using 4x NVIDIA A100 GPUs, distributed across four nodes interconnected by Slingshot Interconnection. They operated with an effective batch size of 512 and a context length of 2048.