Better Training using Weight-Constrained Stochastic Dynamics
Authors: Benedict Leimkuhler, Tiffany J Vlaar, Timothée Pouchon, Amos Storkey
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We study stochastic training methods based on Langevin dynamics combined with algebraic constraints. Our general framework allows for incorporating constraints into standard training schemes and sampling methods for neural networks. Constraints provide direct control of the parameter space of a model and hence afford a means to improve its generalization performance. As applications, we consider magnitude control and orthogonality of neural network weights. (...) We illustrate that the use of a penalty-based soft constraint introduces an undesirable stiffness into the system, needing the stepsize to be lowered to improve performance and to allow for the use of larger penalty strengths. The soft constraint approach is unable to reach the same performance as our o-Co LA-od method (right-most) and its performance is heavily dependent on the choice of penalty strength and step size. |
| Researcher Affiliation | Academia | 1Department of Mathematics, University of Edinburgh, United Kingdom 2Department of Informatics, University of Edinburgh, United Kingdom. Correspondence to: Tiffany Vlaar <Tiffany.Vlaar@ed.ac.uk>. |
| Pseudocode | Yes | Algorithm 1 Orthog. constraint overdamped Langevin Algorithm 2 Orthog. constraint underdamped Langevin |
| Open Source Code | Yes | We provide Py Torch code to support our algorithms, which can be found on https://github.com/ Tiffany Vlaar/Constrained NNtraining |
| Open Datasets | Yes | For a Res Net-34 architecture with Batch Norm and learning rate (LR) decay on CIFAR-10 (Krizhevsky & Hinton, 2009) data (...) We evaluate our circle constrained c-Co LA-ud method on the Fashion-MNIST data set (Xiao et al., 2017). (...) Table 2. Minimum val. loss on Penn Treebank data (batchsize 1024) (Marcus et al., 1993) and Wikitext-2 (batchsize 128) (Merity et al., 2017) |
| Dataset Splits | No | We reduce the amount of training data to 10K samples and use the remaining 60K samples as test data. (...) Minimum val. loss on Penn Treebank data (batchsize 1024) (Marcus et al., 1993) and Wikitext-2 (batchsize 128) (Merity et al., 2017). The paper mentions the use of validation data for Penn Treebank and Wikitext-2 by reporting 'Minimum val. loss', but does not specify the exact split percentages or counts for these datasets, nor for Fashion-MNIST. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory, or cloud instance types used for the experiments. |
| Software Dependencies | No | The paper mentions 'Py Torch code' but does not specify a version number for PyTorch or any other software dependencies with their versions. |
| Experiment Setup | Yes | Hyperpar. settings: all: h = 0.05, 2% subsampling; SGD with WD = 1e-4; C-SGD: r0 = 1, r1 = 5 (see Eq. (2)); SGLD and C-SGLD: = 5e-5 (see Eq. (7)). (...) We set stepsize h = 0.1 for all methods and use 5% subsampling. We found the optimal penalty strength λ = 0.05 for the orthogonal regularization method through line search. (...) For SGD we initially use h = 0.1 and decay by a factor 10 every 50 epochs (...). We set momentum = 0.9 (...) Hyperparameters c-Co LA-ud: h = 0.3, γ = 1, r0 = 0.05, r1 = 0.1, = 0. (...) Hyperpar. c-Co LA-ud: h = 0.4, r = 0.5, r L = 0.1, r N = 1, r A = 1, = 0, γ = 0.5 (Treebank) and γ = 1 (Wikitext-2), where the subscripts L, N, A represent the radii belonging to the linear, norm and selfattention layers respectively. |