reproducibilityindex.ai

Distributed Inference and Fine-tuning of Large Language Models Over The Internet

Authors: Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, Colin A. Raffel

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of our system in simulated conditions and a real-world setup spanning two continents.
Researcher Affiliation	Collaboration	Alexander Borzunov HSE Univesity, Yandex Max Ryabinin HSE Univesity, Yandex Artem Chumachenko Neiro.ai Dmitry Baranchuk Yandex Tim Dettmers University of Washington Younes Belkada Hugging Face Pavel Samygin Yandex School of Data Analysis Colin Raffel Hugging Face
Pseudocode	Yes	Algorithm 1 Generating sequence, client-side code; Algorithm 2 rpc_inference(server); Algorithm 3 replace_failed_server(...)
Open Source Code	Yes	PETALS source code and documentation are available at https://petals.dev
Open Datasets	Yes	We evaluate our system on more practical tasks of running Llama 2 (70B) (Touvron et al., 2023b) and BLOOM (176B) (Big Science, 2022a).
Dataset Splits	No	The paper describes the models and tasks used for evaluation but does not specify dataset splits (e.g., train/validation/test percentages or counts) for reproduction.
Hardware Specification	Yes	Each pipeline stage is served by a single Ge Force 1080 Ti GPU; the four GPUs are running in a single system with dual Xeon Gold 6148 CPU, 12 DDR4 LRDIMM sticks with 64 GB each. We measure performance for (a) Llama 2 distributed across 3 servers with a T4 GPU each, (b) BLOOM distributed across 3 servers with an A100 (80 GB) GPU each, and (c) BLOOM distributed across 10 servers with an RTX 3090 GPU each. Finally, we benchmark BLOOM in a real-world setup with 14 smaller servers holding 2 RTX 3060, 4 2080Ti, 2 3090, 2 A4000, and 4 A5000 GPUs.
Software Dependencies	No	The paper mentions "Deep Speed with default recommended parameters" and "Deep Speed v0.7.7" but does not provide specific version numbers for other core software components like programming languages (e.g., Python) or other libraries (e.g., PyTorch).
Experiment Setup	Yes	We compare our algorithm with baselines when generating a single sequence of length 512. All runs use four pipeline stages with (8, 7, 8, 7) model layers per pipeline stage. We use 4-bit Normal Float quantization (Dettmers et al., 2023) for Llama 2 and 8-bit matrix decomposition (Dettmers et al., 2022a) for BLOOM in all evaluations. We try (a) both prompt tuning and prefix tuning (involving deep prompts), (b) two batch sizes (8 and 32), and (c) two prompt lengths (16 and 4).