Towards a Better Theoretical Understanding of Independent Subnetwork Training
Authors: Egor Shulgin, Peter Richtárik
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model. ... Empirical validation of the proposed theory through experiments for several practical settings. |
| Researcher Affiliation | Academia | Egor Shulgin 1 Peter Richtárik 1 King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. Correspondence to: Egor Shulgin <egor.shulgin@kaust.edu.sa>. |
| Pseudocode | Yes | Algorithm 1 Distributed Submodel (Stochastic) Gradient Descent |
| Open Source Code | No | The paper does not provide an explicit statement or a link to open-source code for the methodology described. |
| Open Datasets | Yes | Res Net-50 model (He et al., 2016) pre-trained on Image Net is used as a feature extractor and concatenated with two fully connected layers. The resulting model is then trained on the CIFAR-10 (Krizhevsky et al., 2009) dataset. |
| Dataset Splits | No | The paper mentions using the CIFAR-10 dataset but does not specify how it was split into training, validation, and test sets, or any explicit percentages or counts for these splits. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used for running experiments. It mentions 'computing nodes' but provides no specific details such as GPU models, CPU types, or memory. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as Python versions or library versions (e.g., PyTorch 1.9). |
| Experiment Setup | Yes | Specifically, we consider a quadratic problem defined in (10), where Li = Bi Bi. Entries of the matrices Bi Rd d, vectors bi Rd, and initialization x0 Rd are generated from a standard Gaussian distribution N(0, 1). ... We fix the dimension d to 1000 and the number of computing nodes n to 10. ... In Figure 1(b), we demonstrate the convergence of the iterates xk for a homogeneous problem with d = n = 50. ... Namely, we consider Algorithm 1 with 1) Ci chosen as Perm-q (6) for IST and 2) Ci = I for Distributed Gradient Descent (DGD). Both methods are implemented across n = 10 nodes, employing constant step sizes γ, and one local step per communication round. ... For even smaller step size values (γ {0.01, 0.02}), the method converges to a higher error floor. Interestingly, if γ is decreased by 10 every 1000 iterations, the method’s performance (red dotted curve) almost does not change. |