Scalable Distributed DL Training: Batching Communication and Computation

Authors: Shaoqi Wang, Aidi Pi, Xiaobo Zhou5289-5296

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement i Batch in the open-source DL framework Big DL and perform evaluations with various DL workloads. Experimental results show that i Batch improves the scalability of a cluster of 72 nodes by up to 73% over the default PS and 41% over the layer by layer strategy.
Researcher Affiliation Academia Shaoqi Wang, Aidi Pi, Xiaobo Zhou Department of Computer Science University of Colorado, Colorado Springs, CO, USA {swang, epi, xzhou}@uccs.edu
Pseudocode Yes Algorithm 1 Greedy algorithm that generates li from l1 to l N 1
Open Source Code No We have implemented i Batch in Big DL (version 0.5.0) by modifying source files in package com.intel.analytics.bigdl. The paper states BigDL is open-source but does not explicitly state that the iBatch implementation itself is released or provide a link to the modified source files.
Open Datasets Yes We use two well-known image classification datasets. (1) Image Net22K, the largest public dataset for image classification, including 14.2 million labeled images from 21841 categories. (2) ILSVRC12, a subset of Image Net22K that has 1.28 million of training images;
Dataset Splits No The paper mentions using datasets for training but does not provide specific details on how the data was split into training, validation, and test sets (e.g., percentages or counts).
Hardware Specification Yes We conduct our experiments on a CPU cluster in a private cloud. The cloud runs on 8 HP BL460c G6 blade servers interconnected with 10Gbps global Ethernet.
Software Dependencies Yes We have implemented i Batch in Big DL (version 0.5.0) by modifying source files in package com.intel.analytics.bigdl.
Experiment Setup Yes The goal of i Batch is to minimize the execution time including the total parameter communication time and the forward computation time. We first formulate the batching decision as an optimization problem of execution time minimization based on the profile of the parameter communication time and the forward computation time. Then, we use greedy algorithm that maximizes the overlap to solve the problem and derive communication and computation batches.