Implicit regularization of deep residual networks towards neural ODEs
Authors: Pierre Marion, Yu-Han Wu, Michael Eli Sander, Gérard Biau
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 NUMERICAL EXPERIMENTS We now present numerical experiments to validate our theoretical findings, using both synthetic and real-world data. Our code is available on Git Hub (see Appendix E for details and additional plot). 5.1 SYNTHETIC DATA We consider the residual network (3) with the initialization scheme of Section 3. The activation function is GELU (Hendrycks & Gimpel, 2016), which is a smooth approximation of Re LU: x 7 max(x, 0). The sample points (xi, yi)1 i n follow independent standard Gaussian distributions. The mean-squared error is minimized using full-batch gradient descent. The following experiments exemplify the large-depth (t [0, T], L ) and long-time (t , L finite) limits. ... 5.2 REAL-WORLD DATA We now investigate the properties of deep residual networks on the CIFAR 10 dataset (Krizhevsky, 2009). ... Table 1 reports the accuracy of the trained network, and whether it has Lipschitz continuous (or smooth) weights after training, depending on the activation function σ and on the initialization scheme. |
| Researcher Affiliation | Academia | Pierre Marion , Yu-Han Wu LPSM Sorbonne Université, CNRS Paris, France Michael E. Sander DMA ENS, CNRS Paris, France Gérard Biau LPSM Sorbonne Université, CNRS Paris, France |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available on Git Hub (see Appendix E for details and additional plot). |
| Open Datasets | Yes | 5.2 REAL-WORLD DATA We now investigate the properties of deep residual networks on the CIFAR 10 dataset (Krizhevsky, 2009). |
| Dataset Splits | No | The paper mentions using the CIFAR 10 dataset and training for a certain number of iterations/epochs, but it does not provide specific details on how the dataset was split into training, validation, or test sets (e.g., percentages, sample counts, or references to predefined splits). |
| Hardware Specification | No | This work was granted access to the HPC resources of IDRIS under the allocation 2020-[AD011012073] made by GENCI. This mentions an HPC resource but does not provide specific hardware details (e.g., GPU/CPU models, memory sizes). |
| Software Dependencies | No | We use Pytorch (Paszke et al., 2019). This mentions a software dependency but does not specify its version number or any other software dependencies with versions. |
| Experiment Setup | Yes | 5.1 SYNTHETIC DATA ... Large-depth limit. We take n = 100, d = 16, m = 32. We train for 500 iterations, and set the learning rate to L 10 2. ... Long-time limit. We take n = 50, d = 16, m = 64, L = 64, and train for 80,000 iterations with a learning rate of 5L 10 3. 5.2 REAL-WORLD DATA ... The model is trained using stochastic gradient descent on the cross-entropy loss for 180 epochs. The initial learning rate is 4 10 2 and is gradually decreased using a cosine learning rate scheduler. |