Continuous vs. Discrete Optimization of Deep Neural Networks
Authors: Omer Elkabetz, Nadav Cohen
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We corroborate our theoretical analysis through experiments with basic deep learning settings, which demonstrate that reducing the step size of gradient descent often leads to only slight changes in its trajectory. This confirms that, in basic settings, central aspects of deep neural network optimization may indeed be captured by gradient flow. Our experimental protocol is simple on several deep neural networks classifying MNIST handwritten digits ([35]), we compare runs of gradient descent differing only in the step size η. Specifically, with η0=0.001 (standard choice of step size) and r ranging over {2,5,10,20}, we compare, in terms of training loss value and location in weight space, every iteration of a run using η=η0 to every r th iteration of a run in which η=η0/r. Figure 1 reports the results obtained on fully connected neural networks (as analyzed in Subsection 4.1), with both linear and non-linear activation. |
| Researcher Affiliation | Academia | Omer Elkabetz Tel Aviv University omer.elkabetz@cs.tau.ac.il Nadav Cohen Tel Aviv University cohennadav@cs.tau.ac.il |
| Pseudocode | Yes | Procedure 1 (random balanced initialization). With a distribution P over dn-by-d0 matrices of rank at most min{d0,d1,...,dn}, initialize Wj Rdj,dj 1, j =1,2,...,n, via following steps: (i) sample A P; (ii) take singular value decomposition A=UΣV , where U Rdn,min{d0,dn} and V Rd0,min{d0,dn} have orthonormal columns, and Σ Rmin{d0,dn},min{d0,dn} is diagonal and holds the singular values of A; and (iii) set Wn UΣ1/n,Wn 1 Σ1/n,Wn 2 Σ1/n,...,W2 Σ1/n,W1 Σ1/n V , where stands for equality up to zero-valued padding. |
| Open Source Code | Yes | Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] There are no new assets other than our code (included in supplemental material). |
| Open Datasets | Yes | Our experimental protocol is simple on several deep neural networks classifying MNIST handwritten digits ([35])... [35] Yann Le Cun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/, 1998. |
| Dataset Splits | No | The paper does not explicitly state training, validation, or test dataset splits (e.g., percentages or sample counts). It mentions using MNIST and focuses on training loss. |
| Hardware Specification | No | The paper states 'Our models were implemented in PyTorch [44] and trained on commodity GPUs.' This description is too general and does not provide specific GPU models or other hardware details. |
| Software Dependencies | No | The paper mentions 'Our models were implemented in PyTorch [44]', but it does not specify a version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | Networks had depth n=3, input dimension d0=784 (corresponding to 28 28=784 pixels), hidden widths d1=d2=50 and output dimension d3=10 (corresponding to ten possible labels). Training was based on gradient descent applied to cross-entropy loss with no regularization, starting from a near-zero point drawn from Xavier distribution (cf. [24]). Specifically, with η0=0.001 (standard choice of step size) and r ranging over {2,5,10,20}, we compared, in terms of training loss value and location in weight space, every iteration of a run using η=η0 to every r th iteration of a run in which η=η0/r. |