Training Neural Networks Without Gradients: A Scalable ADMM Approach

Authors: Gavin Taylor, Ryan Burmeister, Zheng Xu, Bharat Singh, Ankit Patel, Tom Goldstein

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present experimental results that compare the performance of the ADMM method to other approaches, including SGD, conjugate gradients, and LBFGS on benchmark classification tasks.
Researcher Affiliation Academia 1United States Naval Academy, Annapolis, MD USA 2University of Maryland, College Park, MD USA 3Rice University, Houston, TX USA
Pseudocode Yes Algorithm 1 ADMM for Neural Nets
Open Source Code No The paper does not provide any concrete access information for the source code, such as a repository link or an explicit statement about code release.
Open Datasets Yes The first is a subset of the Street View House Numbers (SVHN) dataset (Netzer et al., 2011). The second dataset is the far more difficult Higgs dataset (Baldi et al., 2014)
Dataset Splits Yes Using the extra dataset to train, this meant 120,290 training datapoints of 648 features each. The testing set contained 5,893 data points." and "The second dataset is the far more difficult Higgs dataset (Baldi et al., 2014), consisting of a training set of 10,500,000 datapoints of 28 features each... The testing set consists of 500,000 datapoints.
Hardware Specification Yes The new ADMM approach was implemented in Python on a Cray XC30 supercomputer with Ivy Bridge processors, and communication between cores performed via MPI. SGD, conjugate gradients, and L-BFGS are run as implemented in the Torch optim package on NVIDIA Tesla K40 GPUs.
Software Dependencies No The new ADMM approach was implemented in Python... SGD, conjugate gradients, and L-BFGS are run as implemented in the Torch optim package. The paper mentions software names but does not provide specific version numbers.
Experiment Setup Yes We choose γi = 10 and βi = 1 for all trials runs reported here... We use training data with binary class labels... We use a separable loss function with a hinge penalty...