Fast Stochastic Bregman Gradient Methods: Sharp Analysis and Variance Reduction

Authors: Radu Alexandru Dragomir, Mathieu Even, Hadrien Hendrikx

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We illustrate the effectiveness of our approach on two key applications of relative smoothness: tomographic reconstruction with Poisson noise and statistical preconditioning for distributed optimization. ... 5. Experiments In order to show the effectiveness of our method, we consider the two key settings mentioned in the introduction: problems with unbounded curvature (inverse problems with Poisson noise) and preconditioned distributed optimization.
Researcher Affiliation Academia 1Universit e Toulouse 1 Capitole 2D.I. Ecole Normale Sup erieure, CRNS, PSL University, Paris 3INRIA Paris.
Pseudocode Yes Algorithm 1 Bregman-SAGA((ηt)t 0, x0)
Open Source Code No The paper does not provide any statement or link indicating that the source code for their methodology is publicly available.
Open Datasets Yes We use the log-barrier reference function, h(x) = P i log xi, for which relative smoothness holds with Lf/h = Pn i=1 bi/n (Bauschke et al., 2017). ... We solve a logistic regression problem for the RCV1 dataset (Lewis et al., 2004).
Dataset Splits No The paper mentions using specific datasets but does not provide details on training, validation, or test splits, such as percentages, sample counts, or explicit instructions for partitioning the data.
Hardware Specification No The paper does not provide any specific hardware details (e.g., CPU/GPU models, memory, or cloud instances) used for running the experiments.
Software Dependencies No The paper discusses algorithms and concepts but does not list any specific software components with version numbers (e.g., Python, PyTorch, specific libraries or solvers) used for implementation or experimentation.
Experiment Setup Yes A fixed learning rate is used, and the best one is selected selected among [0.025, 0.05, 0.1, 0.25, 0.5, 1.]. BGD uses η = 0.5 while SAGA and BSGD use η = 0.05. The x-axis represents the total number of communications (or number of passes over the dataset). Note that at each epoch, BGD communicates once with all workers (one round trip for each worker) whereas BSGD and BSAGA communicate n times with one worker sampled uniformly at random each time. ... Regularization is taken as λ = 10 5, and there are n = 100 nodes with N = 1000 samples each.