Integrated Hardware Architecture and Device Placement Search
Authors: Irene Wang, Jakub Tarnawski, Amar Phanishayee, Divya Mahajan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. and We evaluate PHAZE, the architecture search and solver, on a diverse set of large language models deployed in distributed training environments. |
| Researcher Affiliation | Collaboration | 1Georgia Institute of Technology, GA, USA 2Microsoft Research, WA, USA. |
| Pseudocode | Yes | Algorithm 1 PHAZE workflow algorithm |
| Open Source Code | Yes | The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze. |
| Open Datasets | Yes | We obtain OPT (Zhang et al., 2022b), Bertlarge (Devlin et al., 2019), GPT2 (Radford et al., 2019), and Llama2-7B (Touvron et al., 2023) from the Hugging Face library (Wolf et al., 2019) and TMP graphs and hyper-parameters from public source code of Megatron-LM (Nvidia, b; Shoeybi et al., 2020). |
| Dataset Splits | No | The paper refers to training and evaluation but does not explicitly provide percentages or absolute counts for dataset splits like train/validation/test, nor does it refer to standard dataset splits for reproduction. |
| Hardware Specification | Yes | PHAZE is executed on a V100 GPU and a Dual AMD Epyc 7713 CPU at 2.0 GHz with 128 cores, running Ubuntu 20.04. The GPU runs CUDA 12.1 and is only used to extract the operator graphs. |
| Software Dependencies | Yes | The overall PHAZE process is executed using Python 3.8. The ILP formulations are solved using Gurobi 10.0.1 (Gurobi Optimization, 2019). The dynamic programming algorithm is implemented in C++, compiled with g++ version 11.3.0 and -O3 optimization flag. The GPU runs CUDA 12.1... |
| Experiment Setup | Yes | Table 1: Architecture and training search parameters explored in PHAZE for per device execution. ... Microbatch Size mbs 1 to 8 powers of 2 Activation Recomputation True/False. and PHAZE is optimized over 1024 accelerators and a global batch size of 4096. |