Sketching for Distributed Deep Learning: A Sharper Analysis

Authors: Mayank Shrivastava, Berivan Isik, Qiaobo Li, Sanmi Koyejo, Arindam Banerjee

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical results both on the loss Hessian and overall accuracy of sketch-DL supporting our theoretical results. Taken together, our results provide theoretical justification for the observed empirical success of sketch-DL. In this section, we provide a comparison of the sketching approach in Algorithm 1 with other common approaches such as local Top-r [44] and Fetch SGD [59]. ... We train Res Net-18 [28] on CIFAR-10 dataset [38] that is i.i.d. distributed to 100 clients. Each client performs 5 local gradient descent iterations (i.e., using full-batch of size 500) at every round. Figure 1 shows that Count-Sketch-based distributed learning approach in Algorithm 1 performs competitively with Fetch SGD.
Researcher Affiliation Collaboration Mayank Shrivastava University of Illinois Urbana-Champaign mayanks4@illinois.edu Berivan Isik Google berivan@google.com Qiaobo Li University of Illinois Urbana-Champaign qiaobol2@illinois.edu Sanmi Koyejo Stanford University sanmi@cs.stanford.edu Arindam Banerjee University of Illinois Urbana-Champaign arindamb@illinois.edu
Pseudocode Yes Algorithm 1 Sketching-Based Distributed Learning. Hyperparameters: server learning rate ηglobal, local learning rate ηlocal. Inputs: local datasets Dc of size nc for clients c = 1, . . . , C, number of communication rounds T. Output: final model θT .
Open Source Code No Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Due to confidentiality constraints, we are unable to share the code at this time, but we provide sufficient details to reproduce the results.
Open Datasets Yes We train Res Net-18 [28] on CIFAR-10 dataset [38] that is i.i.d. distributed to 100 clients.
Dataset Splits No The paper states 'We train Res Net-18 [28] on CIFAR-10 dataset [38]...' but does not explicitly provide the training/validation/test dataset splits or percentages. While CIFAR-10 has standard splits, the paper does not specify how it utilized them for reproduction, only mentioning 'Each client performs 5 local gradient descent iterations (i.e., using full-batch of size 500) at every round.'
Hardware Specification Yes We conducted our experiments on NVIDIA Titan X GPUs on an internal cluster server, using 1 GPU per one run.
Software Dependencies No The paper does not provide specific version numbers for software dependencies used in its experiments, such as programming languages or deep learning frameworks (e.g., Python, PyTorch, TensorFlow versions). While it mentions 'Py Hessian' in Appendix G, it does not specify a version or list other software with versions for the main experimental setup.
Experiment Setup Yes We train Res Net-18 [28] on CIFAR-10 dataset [38] that is i.i.d. distributed to 100 clients. Each client performs 5 local gradient descent iterations (i.e., using full-batch of size 500) at every round. ... We use a learning rate of 1e 3, SGD as the optimizer and and perform GD.