An improved analysis of per-sample and per-update clipping in federated learning

Authors: Bo Li, Xiaowen Jiang, Mikkel N. Schmidt, Tommy Sonne Alstrøm, Sebastian U Stich

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally validate our theoretical statements under different levels of data heterogeneity. As the effect of stochastic noise has been thoroughly evaluated experimentally Koloskova et al. (2023), we here mainly focus on demonstrating the impact of the data heterogeneity ζ2 on the convergence. Our main findings are: 1) when the data heterogeneity is low, per-sample and per-update clipping have similar convergence behaviour2) when the data heterogeneity is high, per-sample clipping can converge to a neighborhood of the target accuracy. However, this neighbourhood size is reducing as clipping threshold c increases 3) Per-update clipping can reach the target accuracy at the cost of more communication rounds. See Appendix G for experimental setup and Appendix I for a more complicated NN experiment on CIFAR10 dataset. Experimental results: We tune the stepsize for all the experiments to reach the desired target accuracy ε := || f(xt)|| with the fewest rounds. We show the required number of communication rounds to reach the target accuracy ε = 0.18 using per-sample clipping in Fig. 1 and per-update clipping in Fig. 2.
Researcher Affiliation Academia blia@dtu.dk Xiaowen Jiang xiaowen.jiang@cispa.de Mikkel N. Schmidt DTU mnsc@dtu.dk Tommy S. Alstrøm DTU tsal@dtu.dk Sebastian U. Stich CISPA stich@cispa.de Technical University of Denmark CISPA Helmholtz Center for Information Security
Pseudocode Yes Algorithm 1 Per-sample clipping ... Algorithm 2 Per-update clipping
Open Source Code No We refer to Appendix G for the implementation of the two clipping methods as well as the hyperparameter choices and software.
Open Datasets Yes We illustrate the performance using multinomial logistic regression (Greene, 2003) on the MNIST dataset (Le Cun & Cortes, 2010). [...] We use a simple deep neural network with two convolution layers (32 and 64 channels) and two fully connected layers (hidden dimension 512) on CIFAR10 for classification.
Dataset Splits Yes We illustrate the performance using multinomial logistic regression (Greene, 2003) on the MNIST dataset (Le Cun & Cortes, 2010). We use ten workers with full participation. We randomly subsample 1024 images into each worker to use full-batch gradients (σ = 0). We vary the number of classes in each worker to simulate different levels of data heterogeneity following Hsu et al. (2019). [...] We split the CIFAR10 training data into 10 clients following Dirichlet distribution Kairouz et al. (2019) with concentration parameter 0.1
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU/CPU models or memory.
Software Dependencies Yes We implement all the models with Py Torch 1.7.1 and Python 3.7.9.
Experiment Setup Yes We illustrate the performance using multinomial logistic regression (Greene, 2003) on the MNIST dataset (Le Cun & Cortes, 2010). [...] We tune the stepsize for all the experiments to reach the desired target accuracy ε := || f(xt)|| with the fewest rounds. [...] For per-update clipping, we experiment with using stepsize from {0.00625, 0.0125, 0.025, 0.05, 0.1, 0.2}. For per-sample clipping, we experiment with using stepsize from {0.1, 0.2, 0.4, 0.8}. [...] We use τ = 10 local steps, batch size of 1024