$S^3$: Increasing GPU Utilization during Generative Inference for Higher Throughput

Authors: Yunho Jin, Chun-Feng Wu, David Brooks, Gu-Yeon Wei

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate S3, assessing both its throughput and cost-efficiency. Our analysis includes both offline and online scenarios. In online scenarios under the average reading speed latency SLO constraint, we find that S3 can generate up to 6.49 more sequences while adhering to the same SLO constraint. In offline scenarios, we observe that S3 achieves a speedup up to 6.49 for different models.
Researcher Affiliation Academia Yunho Jin Harvard University Chun-Feng Wu National Yang Ming Chiao Tung University David Brooks Harvard University Gu-Yeon Wei Harvard University
Pseudocode No No pseudocode or algorithm blocks are present.
Open Source Code No No explicit statement or link for open-source code for S3 implementation.
Open Datasets Yes Specifically, we fine-tune the model on the Alpaca dataset [18], one of the representative questionand-answering datasets, and use the questions as inputs and the lengthes of the answers as labels. We also evaluate the predictor on a model fine-tuned with Google Natural-Question dataset [19] and observe an accuracy of 77.13%. For completeness, we fine-tune a model on the Pile dataset [20], a nonquestion-and-answering dataset, and see 65.6% accuracy.
Dataset Splits No The paper mentions using Alpaca, Google-NQ, and The Pile datasets for evaluation and fine-tuning, but does not specify the train/validation/test splits (e.g., percentages, counts, or references to predefined splits).
Hardware Specification Yes We run our evaluation on an NVIDIA 80GB A100 GPU connected to the host DRAM via PCIe 4.0 8 in a Lenovo Think System SD650-N V2 Server [27].
Software Dependencies No We implement the systems on top of Faster Transformer [4] since this library is faster than HFTransformers [3] due to more optimizations.
Experiment Setup No Each bucket is allocated the range of max sequence length / number of buckets and we use 10 buckets. We fine-tune a Distilbert [17] model that was trained for sequence classification to classify which length bucket the output sequence length falls into.