Evolutionary Stochastic Gradient Descent for Optimization of Deep Neural Networks
Authors: Xiaodong Cui, Wei Zhang, Zoltán Tüske, Michael Picheny
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The effectiveness of ESGD is demonstrated across multiple applications including speech recognition, image recognition and language modeling, using networks with a variety of deep architectures. We evaluate the performance of the proposed ESGD on large vocabulary continuous speech recognition (LVCSR), image recognition and language modeling. We compare ESGD with two baseline systems. |
| Researcher Affiliation | Industry | Xiaodong Cui, Wei Zhang, Zoltán Tüske and Michael Picheny IBM Research AI IBM T. J. Watson Research Center Yorktown Heights, NY 10598, USA {cuix, weiz, picheny}@us.ibm.com, {Zoltan.Tuske}@ibm.com |
| Pseudocode | Yes | Algorithm 1: Evolutionary Stochastic Gradient Descent (ESGD) |
| Open Source Code | No | The paper does not provide a statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | BN50 The 50-hour Broadcast News is a widely used dataset for speech recognition [31]. The 50-hour data consists of 45-hour training set and a 5-hour validation set. SWB300 The 300-hour Switchboard dataset is another widely used dataset in speech recognition [31]. The CIFAR10 [35] dataset is a widely used image recognition benchmark. It contains a 50K image training-set and a 10K image test-set. The evaluation of the ESGD algorithm is also carried out on the standard language modeling task Penn Treebank (PTB) dateset [38]. |
| Dataset Splits | Yes | The 50-hour data consists of 45-hour training set and a 5-hour validation set. The initial learning rate is 0.001 for BN50 and 0.025 for SWB300. The learning rate is annealed by 2x every time the loss on the validation set of the current epoch is worse than the previous epoch and meanwhile the model is backed off to the previous epoch. Note that CIFAR10 does not include a validation set. To be consistent with the training set used in the literature, we do not split a validation-set from the training-set. |
| Hardware Specification | No | The paper mentions 'All the SGD updates and fitness evaluation are carried out in parallel on a set of GPUs.' but does not specify particular GPU models or other hardware components. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9). |
| Experiment Setup | Yes | The single baseline is trained using SGD with a batch size 128 without momentum for 20 epochs. The initial learning rate is 0.001 for BN50 and 0.025 for SWB300. The learning rate is annealed by 2x every time the loss on the validation set of the current epoch is worse than the previous epoch and meanwhile the model is backed off to the previous epoch. The population sizes for both baseline and ESGD are 100. The offspring population of ESGD consists of 400 individuals. In ESGD, after 15 generations (Ks = 1), a 5-epoch fine-tuning is applied to each individual with a small learning rate. For the single-run baseline, we follow the recipes proposed in [37], in which the initial learning rate is 0.1 and gets annealed by 10x after 81 epochs and then annealed by another 10x at epoch 122. Training finishes in 160 epochs. The model is trained by SGD using Nesterov acceleration with a momentum 0.9. |