Sequential Gradient Coding For Straggler Mitigation
Authors: Nikhil Krishnan Muralee Krishnan, MohammadReza Ebrahimi, Ashish J Khisti
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we study a practical setting of concurrently training multiple neural networks over an AWS Lambda cluster involving 256 worker nodes, where our framework naturally applies. We demonstrate that the latter scheme can yield a 16% improvement in runtime over the baseline GC scheme, in the presence of naturally occurring, non-simulated stragglers. |
| Researcher Affiliation | Academia | M. Nikhil Krishnan Indian Institute of Technology Palakkad nikhilkrishnan.m@gmail.com M. Reza Ebrahimi University of Toronto mr.ebrahimi@mail.utoronto.ca Ashish Khisti University of Toronto akhisti@ece.utoronto.ca |
| Pseudocode | Yes | Algorithm 1 Algorithm used by master to assign tasks in round-t; Algorithm 2 Algorithm used by master to assign mini-tasks in round-t |
| Open Source Code | Yes | We use AWS Serverless Application Model (SAM) tool to define, manage, and deploy the cloud resources (included in the code submitted as supplementary material). |
| Open Datasets | Yes | In each experiment, we run a total of J = 480 jobs (120 jobs per classifier) using the three schemes, namely GC, SR-SGC and M-SGC. As a baseline, we also train the classifiers without any coding wherein the master node should wait for all the workers to return their task results. Finally, each experiment is repeated 10 times to report the first and second-order statistics of total run times. Before training the models, we perform some shorter experiments to choose the best-performing parameters for each of the three coding schemes. Specifically, for GC, we perform a grid search over s and select the value corresponding to the shortest run time. We refer readers to Appendix J for a detailed discussion on the procedure of selecting the parameters for SR-SGC and M-SGC schemes, as well as analysis of sensitivity to parameters. ... L TRAINING RESNET-18 ON CIFAR-100 ...We used n = 256 Lambda workers to train M = 4 models concurrently for 1000 rounds (250 rounds for each classifier). |
| Dataset Splits | No | The paper mentions training and testing but does not explicitly describe a separate validation dataset split with specific percentages or counts. It refers to selecting parameters but not a validation split for model performance. |
| Hardware Specification | Yes | Our experiment setup consists of a master node and n = 256 workers. In Fig. 1, we demonstrate statistics of response time across 100 rounds, where each worker calculates gradients for a batch of 16 MNIST images on a CNN involving three convolutional layers, followed by two fully connected layers. ... We use AWS Lambda, a fully-managed and cost-efficient serverless cloud computing service. Workers are invoked from the master node using HTTP requests, and task results are received in the HTTP response payload. ... Each Lambda instance has 2500 MB of RAM, 1 v CPU |
| Software Dependencies | No | The paper mentions |
| Experiment Setup | Yes | Our experiment setup consists of a master node and n = 256 workers. ... for the sake of consistency, we choose ยต = 1 for all experiments. ... We train M = 4 CNN classifiers for MNIST concurrently ... In every round, master samples a batch of 4096 data points and distributes them among the workers. ... We use cross entropy as the loss function and ADAM as the optimizer. ... In each experiment, we run a total of J = 480 jobs (120 jobs per classifier) using the three schemes, namely GC, SR-SGC and M-SGC. ... We used n = 256 Lambda workers to train M = 4 models concurrently for 1000 rounds (250 rounds for each classifier). A batch size of 512 samples and ADAM optimizer is used. |