Cheaply Estimating Inference Efficiency Metrics for Autoregressive Transformer Models
Authors: Deepak Narayanan, Keshav Santhanam, Peter Henderson, Rishi Bommasani, Tony Lee, Percy S. Liang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We use Megatron (NVIDIA), a high-performance GPU implementation of Transformer models with support for autoregressive inference5. For a given model, we used the minimum number of GPUs necessary to minimize cost. For example, Open AI/davinci cannot fit on a single 80-GB A100 GPU; we use tensor model parallelism (Shoeybi et al., 2019) to ensure that the model parameters fit in GPU memory in such cases. Tensor model parallelism works well within a multi-GPU server (Narayanan et al., 2021) since expensive all-to-all communication collectives like all-reduce are limited to fast high-bandwidth NVLink. For even larger models like MS+NV/TNLG v2, we need other forms of parallelism like pipeline model parallelism in order to fit the model in GPU memory without poor scaling. We used NVIDIA HGX servers with 8 NVIDIA A100 SXM4 80GB GPUs; A100 GPUs were the fastest widely available GPU as of October 2022, when we did this work. We evaluate 10 models, ranging in size from 6 to 530 billion parameters. Evaluated models are available in different ways: some were public via a commercial API (e.g., Open AI/davinci, AI21/J1-Jumbo v1), some were private but the model owner provided research access for this effort (Anthropic/v4-s3, MS+NV/TNLG v2), and some were public and free (e.g., Big Science/BLOOM) and were run using the Together API6. We do not evaluate models with withheld model architecture details (e.g., Chat GPT). Table 1 shows the full set of evaluated models, along with the key hyperparameters released by the respective model owner that determine their size. Results. Figure 2 shows the end-to-end runtime measured using the above setup, versus number of generated output tokens for different prompt sizes and models. We instantiate models based on reported architectures, but with random (untrained) parameters, as we only care about estimating runtime, and runtime is independent of the model s parameters given a prompt size and number of output tokens. We randomly sampled 4 prompt sizes ({1, 512, 1024, 1536}) from the space of all possible prompt sizes and 7 different number of output tokens ({1, 2, 4, 8, 16, 32, 64}). Runtime was averaged over 100 prompts of the same size. For each p, we compute a best-fit line using linear regression. We observe that the coefficients of determination (R2) for the resulting time estimates are very close to 1.0 (> 0.999) for all models and conclude runtime shows a linear relationship with the number of output tokens for each prompt size (i.e., output_generation_time is a linear function of o). |
| Researcher Affiliation | Collaboration | Deepak Narayanan NVIDIA dnarayanan@nvidia.com Keshav Santhanam Stanford University keshav2@cs.stanford.edu Peter Henderson Stanford University phend@cs.stanford.edu Rishi Bommasani Stanford University nlprishi@stanford.edu Tony Lee Stanford University tonyhlee@stanford.edu Percy Liang Stanford University pliang@cs.stanford.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks clearly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Our code is open sourced at https://github.com/stanford-crfm/helm-efficiency. |
| Open Datasets | Yes | We consider four tasks in HELM: a sentiment analysis task (IMDB), two question answering tasks (MMLU [college chemistry] (Hendrycks et al., 2020) and Bool Q (Clark et al., 2019)), and a classification task (RAFT [terms of service] (Alex et al., 2021)). |
| Dataset Splits | No | The paper mentions using HELM and BIG-Bench, which are evaluation frameworks, but it does not explicitly provide specific train/validation/test dataset split information (percentages, sample counts, or explicit standard split names/citations) within the paper. |
| Hardware Specification | Yes | We used NVIDIA HGX servers with 8 NVIDIA A100 SXM4 80GB GPUs; A100 GPUs were the fastest widely available GPU as of October 2022, when we did this work. |
| Software Dependencies | Yes | CUDA version 11.5.0, Megatron commit hash e156d2f and fp16 precision. |
| Experiment Setup | Yes | Setup. We use Megatron (NVIDIA), a high-performance GPU implementation of Transformer models with support for autoregressive inference5. For a given model, we used the minimum number of GPUs necessary to minimize cost. For example, Open AI/davinci cannot fit on a single 80-GB A100 GPU; we use tensor model parallelism (Shoeybi et al., 2019) to ensure that the model parameters fit in GPU memory in such cases. Tensor model parallelism works well within a multi-GPU server (Narayanan et al., 2021) since expensive all-to-all communication collectives like all-reduce are limited to fast high-bandwidth NVLink. For even larger models like MS+NV/TNLG v2, we need other forms of parallelism like pipeline model parallelism in order to fit the model in GPU memory without poor scaling. We used NVIDIA HGX servers with 8 NVIDIA A100 SXM4 80GB GPUs; A100 GPUs were the fastest widely available GPU as of October 2022, when we did this work. ... We randomly sampled 4 prompt sizes ({1, 512, 1024, 1536}) from the space of all possible prompt sizes and 7 different number of output tokens ({1, 2, 4, 8, 16, 32, 64}). Runtime was averaged over 100 prompts of the same size. |