Escaping Saddle Points with Compressed SGD

Authors: Dmitrii Avdiukhin, Grigory Yaroslavtsev

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we show that noisy Compressed SGD achieves convergence comparable with full SGD and successfully escapes saddle points. We perform our first set of experiments on Res Net34 model trained using CIFAR-10 dataset with step size 0.1. We analyze convergence of compressed SGD with RANDOMK compressor when 100%, 10%, 1% and 0.1% random gradient coordinates are communicated. Figure 1 shows that SGD with RANDOMK with 10% or 1% of coordinates compression converges as fast as the full SGD, while requiring substantially smaller communication.
Researcher Affiliation Academia Dmitrii Avdiukhin Department of Computer Science Indiana University Bloomington, IN 47405 davdyukh@iu.edu Grigory Yaroslavtsev Department of Computer Science George Mason University Fairfax, VA 22030 grigory@grigory.us
Pseudocode Yes Algorithm 1: Compressed SGD
Open Source Code No The paper does not provide an unambiguous statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We perform our first set of experiments on Res Net34 model trained using CIFAR-10 dataset with step size 0.1. We compare uncompressed SGD, SGD with TOPK compressor (0.1% of coordinates), and SGD with RANDOMK compressor (0.1% of coordinates) on deep MNIST autoencoder.
Dataset Splits No The paper uses CIFAR-10 and MNIST datasets but does not explicitly provide specific training, validation, or test dataset split percentages or methodologies beyond implicitly using a test set.
Hardware Specification No The paper discusses distributed settings and mentions 'multiple machines' but does not specify any particular hardware details such as GPU models, CPU types, or cloud instance specifications used for the experiments.
Software Dependencies No The paper does not provide specific software names with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA) that would be needed to replicate the experiment.
Experiment Setup Yes We perform our first set of experiments on Res Net34 model trained using CIFAR-10 dataset with step size 0.1. We distribute the data across 10 machines, such that each machine contains data from a single class. Figure 1: Convergence of distributed SGD (η = 0.1, batch size is 8 per machine) with RANDOMK compressor... Figure 2: Convergence of SGD (η = 0.1, batch size is 64)... and with Gaussian noise (green, σ = 0.01 for each coordinate).