Towards a General Theory of Infinite-Width Limits of Neural Classifiers
Authors: Eugene Golikov
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We prove a convergence theorem for it, and show that it provides a more reasonable approximation for finite-width nets compared to the NTK limit if learning rates are not very small. Also, our framework suggests a limit model that coincides neither with the MF limit nor with the NTK one. We show that for networks with more than two hidden layers RMSProp training has a non-trivial discrete-time MF limit but GD training does not have one. Overall, our framework demonstrates that both MF and NTK limits have considerable limitations in approximating finite-sized neural nets, indicating the need for designing more accurate infinite-width approximations for them.Figure 1. MF, NTK and intermediate scalings result in non-trivial limit models for a single layer neural net. ... We train a 1-hidden layer net on a subset of CIFAR2 (a dataset of the first two classes of CIFAR10) of size 1000 with gradient descent. |
| Researcher Affiliation | Academia | 1Neural Networks and Deep Learning lab., Moscow Institute of Physics and Technology, Moscow, Russia. Correspondence to: Eugene A. Golikov <golikov.ea@mipt.ru>. |
| Pseudocode | No | No structured pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | No concrete access to source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described in this paper is provided. |
| Open Datasets | Yes | We train a 1-hidden layer net on a subset of CIFAR2 (a dataset of the first two classes of CIFAR10) of size 1000 with gradient descent. |
| Dataset Splits | No | No specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning is provided. The paper only mentions using a 'subset of CIFAR2 (a dataset of the first two classes of CIFAR10) of size 1000'. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments are provided. |
| Software Dependencies | No | No specific ancillary software details (e.g., library or solver names with version numbers like Python 3.8, CPLEX 12.4) needed to replicate the experiment are provided. |
| Experiment Setup | Yes | Setup: We train a 1-hidden layer net on a subset of CIFAR2 (a dataset of the first two classes of CIFAR10) of size 1000 with gradient descent. We take a reference net of width d = 27 = 128 trained with unscaled reference learning rates η a = η w = 0.02 and scale its hyperparameters according to MF (blue curves), NTK (orange curves), and intermediate scaling with qσ = 3/4 (green curves, see text). (from Figure 1 caption) and ...trained with (unscaled) reference learning rates η a = η w = 0.02 for GD and η a = η w = 0.0002 for RMSProp... (from Figure 2 caption). |