Integrated Hardware Architecture and Device Placement Search

Authors: Irene Wang, Jakub Tarnawski, Amar Phanishayee, Divya Mahajan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our approach achieves higher throughput on large language models compared to the state-of-the-art TPUv4 and the Spotlight accelerator search framework. and We evaluate PHAZE, the architecture search and solver, on a diverse set of large language models deployed in distributed training environments.
Researcher Affiliation Collaboration 1Georgia Institute of Technology, GA, USA 2Microsoft Research, WA, USA.
Pseudocode Yes Algorithm 1 PHAZE workflow algorithm
Open Source Code Yes The entire source code of PHAZE is available at https://github.com/msr-fiddle/phaze.
Open Datasets Yes We obtain OPT (Zhang et al., 2022b), Bertlarge (Devlin et al., 2019), GPT2 (Radford et al., 2019), and Llama2-7B (Touvron et al., 2023) from the Hugging Face library (Wolf et al., 2019) and TMP graphs and hyper-parameters from public source code of Megatron-LM (Nvidia, b; Shoeybi et al., 2020).
Dataset Splits No The paper refers to training and evaluation but does not explicitly provide percentages or absolute counts for dataset splits like train/validation/test, nor does it refer to standard dataset splits for reproduction.
Hardware Specification Yes PHAZE is executed on a V100 GPU and a Dual AMD Epyc 7713 CPU at 2.0 GHz with 128 cores, running Ubuntu 20.04. The GPU runs CUDA 12.1 and is only used to extract the operator graphs.
Software Dependencies Yes The overall PHAZE process is executed using Python 3.8. The ILP formulations are solved using Gurobi 10.0.1 (Gurobi Optimization, 2019). The dynamic programming algorithm is implemented in C++, compiled with g++ version 11.3.0 and -O3 optimization flag. The GPU runs CUDA 12.1...
Experiment Setup Yes Table 1: Architecture and training search parameters explored in PHAZE for per device execution. ... Microbatch Size mbs 1 to 8 powers of 2 Activation Recomputation True/False. and PHAZE is optimized over 1024 accelerators and a global batch size of 4096.