Deep learning with Elastic Averaging SGD

Authors: Sixin Zhang, Anna E. Choromanska, Yann LeCun

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient.
Researcher Affiliation Collaboration Sixin Zhang Courant Institute, NYU zsx@cims.nyu.edu Anna Choromanska Courant Institute, NYU achoroma@cims.nyu.edu Yann Le Cun Center for Data Science, NYU & Facebook AI Research yann@cims.nyu.edu
Pseudocode Yes Algorithm 1: Asynchronous EASGD: Processing by worker i and the master
Open Source Code Yes Our implementation is available at https://github.com/sixin-zh/mpi T.
Open Datasets Yes We perform experiments in a deep learning setting on two benchmark datasets: CIFAR-10 (we refer to it as CIFAR) 4 and Image Net ILSVRC 2013 (we refer to it as Image Net) 5. (...) 4Downloaded from http://www.cs.toronto.edu/ kriz/cifar.html. 5Downloaded from http://image-net.org/challenges/LSVRC/2013.
Dataset Splits No The paper mentions using CIFAR and ImageNet datasets but does not explicitly specify the training/validation/test dataset splits (e.g., percentages or counts) within the main text. While these datasets have standard splits, the paper does not state how it utilized them for training, validation, and testing.
Hardware Specification Yes For all our experiments we use a GPU-cluster interconnected with Infini Band. Each node has 4 Titan GPU processors where each local worker corresponds to one GPU processor.
Software Dependencies No The paper does not explicitly provide a reproducible description of ancillary software with specific version numbers. It mentions deep learning frameworks but without specific versions for its own implementation.
Experiment Setup Yes We add l2-regularization λ 2 x 2 to the loss function F(x). For Image Net we use λ = 10 5 and for CIFAR we use λ = 10 4. We also compute the stochastic gradient using mini-batches of sample size 128. (...) For all experiments in this section we use EASGD with β = 0.98 , for all momentum-based methods we set the momentum term δ = 0.99 and finally for MVADOWNPOUR we set the moving rate to α = 0.001.