On Measuring Excess Capacity in Neural Networks

Authors: Florian Graf, Sebastian Zeng, Bastian Rieck, Marc Niethammer, Roland Kwitt

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on benchmark datasets of varying task difficulty indicate that (1) there is a substantial amount of excess capacity per task, and (2) capacity can be kept at a surprisingly similar level across tasks.
Researcher Affiliation Academia Florian Graf University of Salzburg florian.graf@plus.ac.at Sebastian Zeng University of Salzburg sebastian.zeng@plus.ac.at Bastian Rieck Institute for AI and Health Helmholtz Munich bastian@rieck.me Marc Niethammer UNC Chapel Hill mn@cs.unc.edu Roland Kwitt University of Salzburg roland.kwitt@plus.ac.at
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Source code is available at https://github.com/rkwitt/excess_capacity.
Open Datasets Yes We test on three benchmark datasets: CIFAR10/100 [25], and Tiny-Image Net-200 [24], listed in order of increasing task difficulty.
Dataset Splits No We adhere to the common training/testing splits of the three datasets we used, i.e., CIFAR10/100 and Tiny-Image Net-200. (Does not mention validation or specific percentages/numbers for splits).
Hardware Specification Yes Section B.4 lists all hardware resources used in our experiments.
Software Dependencies No The paper mentions optimizers (SGD) and frameworks (PyTorch, TensorFlow, JAX) in general or in relation to third-party tools, but does not provide specific version numbers for the software dependencies used in their own experimental setup.
Experiment Setup Yes We minimize the cross-entropy loss using SGD with momentum (0.9) and small weight decay (1e-4) for 200 epochs with batch size 256 and follow a CIFAR-typical stepwise learning rate schedule, decaying the initial learning rate (of 3e-3) by a factor of 5 at epochs 60, 120 & 160. No data augmentation is used. When projecting onto the constraint sets, we found one alternating projection step every 15th SGD update to be sufficient to remain close to C.