Beyond Backprop: Online Alternating Minimization with Auxiliary Variables
Authors: Anna Choromanska, Benjamin Cowen, Sadhana Kumaravel, Ronny Luss, Mattia Rigotti, Irina Rish, Paolo Diachille, Viatcheslav Gurev, Brian Kingsbury, Ravi Tejwani, Djallel Bouneffouf
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | extensive empirical evaluation on a variety of network architectures and datasets, demonstrating significant advantages of our method vs. offline counterparts, as well as somewhat faster initial convergence as compared to SGD and Adam, followed by similar asymptotic performance; our online method inherits common advantages of similar offline auxiliary-variable methods, including (1) no vanishing gradients, (2) handling of non-differentiable nonlinearities more easily in local subproblems, and (3) the possibility for parallelizing weight updates across layers; similarly to target propagation approaches (Le Cun, 1986; 1987; Lee et al., 2015; Bartunov et al., 2018), our method is based on an explicit propagation of neural activity and local synaptic updates, which is one step closer to a more biologically plausible credit assignment mechanism than backprop; see (Bartunov et al., 2018) for a detailed discussion on this topic. 4. Experiments We compare on several datasets (MNIST, CIFAR10, HIGGS) our online alternating minimization algorithms, AM-mem and AM-Adam (using mini-batches instead of single samples at each time point), against backrop-based online methods, SGD and Adam (Kingma & Ba, 2014), as well as against the offline auxiliary-variable ADMM method of (Taylor et al., 2016), using code provided by the authors. |
| Researcher Affiliation | Collaboration | 1 ECE NYU Tandon 2IBM T.J. Watson Research Center 3MIT. Correspondence to: Irina Rish <IBM>. |
| Pseudocode | Yes | Our approach is outlined in Algorithms 2.1, 2.2, and 2.3, omitting implementation details such as the adaptive µ schedule, hyperparameters controlling the number of iterations in optimization subroutines, and several others; we will make our code available online. Algorithm 2.1 Online Alternating Minimization (AM) Algorithm 2.2 Activation Propagation (Code Update) Steps Algorithm 2.3 Weight and Memory Update Steps |
| Open Source Code | No | Our approach is outlined in Algorithms 2.1, 2.2, and 2.3, omitting implementation details such as the adaptive µ schedule, hyperparameters controlling the number of iterations in optimization subroutines, and several others; we will make our code available online. |
| Open Datasets | Yes | We experiment with fully-connected networks on the standard MNIST (Le Cun, 1998) dataset, consisting of 28 28 grayscale images of hand-drawn digits, with 50K samples, and a test set of 10K samples. Figures 3 and 4 show similar results for the same experiment setting, on the CIFAR10 dataset (5000 training and 10000 test samples). HIGGS dataset, containing 10,500,000 training samples (28 features each) and 500,000 test samples. Next, we evaluate our method on Sequential MNIST (Le et al., 2015)... CNN: Le Net-5, MNIST. |
| Dataset Splits | Yes | Hyperparameters used for each method were optimized by grid search on a validation subset of training data. |
| Hardware Specification | No | The paper mentions training models but does not specify any hardware details (e.g., CPU, GPU models, or memory specifications) used for the experiments. |
| Software Dependencies | No | All our algorithms were implemented in Py Torch (Paszke et al., 2017); we also used Py Torch implementation of SGD and Adam. (No version numbers provided for PyTorch or other libraries). |
| Experiment Setup | Yes | Hyperparameters used for each method were optimized by grid search on a validation subset of training data. We evaluate two different 2-hidden-layer architectures, with equal hidden layer sizes of 100 and 500, and Re LU activations. For all online methods, we use minibatches of size 200, so one epoch over the 10.5M samples equals 52,500 iterations. |