Towards a Better Theoretical Understanding of Independent Subnetwork Training

Authors: Egor Shulgin, Peter Richtárik

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we take a closer theoretical look at Independent Subnetwork Training (IST), which is a recently proposed and highly effective technique for solving the aforementioned problems. We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication, and provide a precise analysis of its optimization performance on a quadratic model. ... Empirical validation of the proposed theory through experiments for several practical settings.
Researcher Affiliation Academia Egor Shulgin 1 Peter Richtárik 1 King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. Correspondence to: Egor Shulgin <egor.shulgin@kaust.edu.sa>.
Pseudocode Yes Algorithm 1 Distributed Submodel (Stochastic) Gradient Descent
Open Source Code No The paper does not provide an explicit statement or a link to open-source code for the methodology described.
Open Datasets Yes Res Net-50 model (He et al., 2016) pre-trained on Image Net is used as a feature extractor and concatenated with two fully connected layers. The resulting model is then trained on the CIFAR-10 (Krizhevsky et al., 2009) dataset.
Dataset Splits No The paper mentions using the CIFAR-10 dataset but does not specify how it was split into training, validation, and test sets, or any explicit percentages or counts for these splits.
Hardware Specification No The paper does not explicitly describe the hardware used for running experiments. It mentions 'computing nodes' but provides no specific details such as GPU models, CPU types, or memory.
Software Dependencies No The paper does not provide specific software dependencies with version numbers, such as Python versions or library versions (e.g., PyTorch 1.9).
Experiment Setup Yes Specifically, we consider a quadratic problem defined in (10), where Li = Bi Bi. Entries of the matrices Bi Rd d, vectors bi Rd, and initialization x0 Rd are generated from a standard Gaussian distribution N(0, 1). ... We fix the dimension d to 1000 and the number of computing nodes n to 10. ... In Figure 1(b), we demonstrate the convergence of the iterates xk for a homogeneous problem with d = n = 50. ... Namely, we consider Algorithm 1 with 1) Ci chosen as Perm-q (6) for IST and 2) Ci = I for Distributed Gradient Descent (DGD). Both methods are implemented across n = 10 nodes, employing constant step sizes γ, and one local step per communication round. ... For even smaller step size values (γ {0.01, 0.02}), the method converges to a higher error floor. Interestingly, if γ is decreased by 10 every 1000 iterations, the method’s performance (red dotted curve) almost does not change.