On Measuring Excess Capacity in Neural Networks
Authors: Florian Graf, Sebastian Zeng, Bastian Rieck, Marc Niethammer, Roland Kwitt
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks. |
| Researcher Affiliation | Academia | Florian Graf University of Salzburg florian.graf@plus.ac.at Sebastian Zeng University of Salzburg sebastian.zeng@plus.ac.at Bastian Rieck Institute for AI and Health Helmholtz Munich bastian@rieck.me Marc Niethammer UNC Chapel Hill mn@cs.unc.edu Roland Kwitt University of Salzburg roland.kwitt@plus.ac.at |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Source code is available at https://github.com/rkwitt/excess_capacity. |
| Open Datasets | Yes | We test on three benchmark datasets: CIFAR10/100 [25], and Tiny-Image Net-200 [24], listed in order of increasing task difficulty. |
| Dataset Splits | No | We adhere to the common training/testing splits of the three datasets we used, i.e., CIFAR10/100 and Tiny-Image Net-200. (Does not mention validation or specific percentages/numbers for splits). |
| Hardware Specification | Yes | Section B.4 lists all hardware resources used in our experiments. |
| Software Dependencies | No | The paper mentions optimizers (SGD) and frameworks (PyTorch, TensorFlow, JAX) in general or in relation to third-party tools, but does not provide specific version numbers for the software dependencies used in their own experimental setup. |
| Experiment Setup | Yes | We minimize the cross-entropy loss using SGD with momentum (0.9) and small weight decay (1e-4) for 200 epochs with batch size 256 and follow a CIFAR-typical stepwise learning rate schedule, decaying the initial learning rate (of 3e-3) by a factor of 5 at epochs 60, 120 & 160. No data augmentation is used. When projecting onto the constraint sets, we found one alternating projection step every 15th SGD update to be sufficient to remain close to C. |