Distributed Inference and Fine-tuning of Large Language Models Over The Internet
Authors: Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin A. Raffel
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents. |
| Researcher Affiliation | Collaboration | Alexander Borzunov HSE Univesity, Yandex Max Ryabinin HSE Univesity, Yandex Artem Chumachenko Neiro.ai Dmitry Baranchuk Yandex Tim Dettmers University of Washington Younes Belkada Hugging Face Pavel Samygin Yandex School of Data Analysis Colin Raffel Hugging Face |
| Pseudocode | Yes | Algorithm 1 Generating sequence, client-side code; Algorithm 2 rpc_inference(server); Algorithm 3 replace_failed_server(...) |
| Open Source Code | Yes | PETALS source code and documentation are available at https://petals.dev |
| Open Datasets | Yes | We evaluate our system on more practical tasks of running Llama 2 (70B) (Touvron et al., 2023b) and BLOOM (176B) (Big Science, 2022a). |
| Dataset Splits | No | The paper describes the models and tasks used for evaluation but does not specify dataset splits (e.g., train/validation/test percentages or counts) for reproduction. |
| Hardware Specification | Yes | Each pipeline stage is served by a single Ge Force 1080 Ti GPU; the four GPUs are running in a single system with dual Xeon Gold 6148 CPU, 12 DDR4 LRDIMM sticks with 64 GB each. We measure performance for (a) Llama 2 distributed across 3 servers with a T4 GPU each, (b) BLOOM distributed across 3 servers with an A100 (80 GB) GPU each, and (c) BLOOM distributed across 10 servers with an RTX 3090 GPU each. Finally, we benchmark BLOOM in a real-world setup with 14 smaller servers holding 2 RTX 3060, 4 2080Ti, 2 3090, 2 A4000, and 4 A5000 GPUs. |
| Software Dependencies | No | The paper mentions "Deep Speed with default recommended parameters" and "Deep Speed v0.7.7" but does not provide specific version numbers for other core software components like programming languages (e.g., Python) or other libraries (e.g., PyTorch). |
| Experiment Setup | Yes | We compare our algorithm with baselines when generating a single sequence of length 512. All runs use four pipeline stages with (8, 7, 8, 7) model layers per pipeline stage. We use 4-bit Normal Float quantization (Dettmers et al., 2023) for Llama 2 and 8-bit matrix decomposition (Dettmers et al., 2022a) for BLOOM in all evaluations. We try (a) both prompt tuning and prefix tuning (involving deep prompts), (b) two batch sizes (8 and 32), and (c) two prompt lengths (16 and 4). |