Continuous vs. Discrete Optimization of Deep Neural Networks

Authors: Omer Elkabetz, Nadav Cohen

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We corroborate our theoretical analysis through experiments with basic deep learning settings, which demonstrate that reducing the step size of gradient descent often leads to only slight changes in its trajectory. This confirms that, in basic settings, central aspects of deep neural network optimization may indeed be captured by gradient flow. Our experimental protocol is simple on several deep neural networks classifying MNIST handwritten digits ([35]), we compare runs of gradient descent differing only in the step size η. Specifically, with η0=0.001 (standard choice of step size) and r ranging over {2,5,10,20}, we compare, in terms of training loss value and location in weight space, every iteration of a run using η=η0 to every r th iteration of a run in which η=η0/r. Figure 1 reports the results obtained on fully connected neural networks (as analyzed in Subsection 4.1), with both linear and non-linear activation.
Researcher Affiliation Academia Omer Elkabetz Tel Aviv University omer.elkabetz@cs.tau.ac.il Nadav Cohen Tel Aviv University cohennadav@cs.tau.ac.il
Pseudocode Yes Procedure 1 (random balanced initialization). With a distribution P over dn-by-d0 matrices of rank at most min{d0,d1,...,dn}, initialize Wj Rdj,dj 1, j =1,2,...,n, via following steps: (i) sample A P; (ii) take singular value decomposition A=UΣV , where U Rdn,min{d0,dn} and V Rd0,min{d0,dn} have orthonormal columns, and Σ Rmin{d0,dn},min{d0,dn} is diagonal and holds the singular values of A; and (iii) set Wn UΣ1/n,Wn 1 Σ1/n,Wn 2 Σ1/n,...,W2 Σ1/n,W1 Σ1/n V , where stands for equality up to zero-valued padding.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] There are no new assets other than our code (included in supplemental material).
Open Datasets Yes Our experimental protocol is simple on several deep neural networks classifying MNIST handwritten digits ([35])... [35] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998.
Dataset Splits No The paper does not explicitly state training, validation, or test dataset splits (e.g., percentages or sample counts). It mentions using MNIST and focuses on training loss.
Hardware Specification No The paper states 'Our models were implemented in PyTorch [44] and trained on commodity GPUs.' This description is too general and does not provide specific GPU models or other hardware details.
Software Dependencies No The paper mentions 'Our models were implemented in PyTorch [44]', but it does not specify a version number for PyTorch or any other software dependency.
Experiment Setup Yes Networks had depth n=3, input dimension d0=784 (corresponding to 28 28=784 pixels), hidden widths d1=d2=50 and output dimension d3=10 (corresponding to ten possible labels). Training was based on gradient descent applied to cross-entropy loss with no regularization, starting from a near-zero point drawn from Xavier distribution (cf. [24]). Specifically, with η0=0.001 (standard choice of step size) and r ranging over {2,5,10,20}, we compared, in terms of training loss value and location in weight space, every iteration of a run using η=η0 to every r th iteration of a run in which η=η0/r.