Optimization and Bayes: A Trade-off for Overparameterized Neural Networks

Authors: Zhengmian Hu, Heng Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first illustrate that Hessian trace doesn’t vanish for overparameterized network and our analysis induces an efficient estimation of this value. Next, we verify our theoretical finding by comparing the dynamics of an overparameterized network in function space and parameter space. Finally, we demonstrate the interpolation of sampling and optimization.
Researcher Affiliation Academia Zhengmian Hu, Heng Huang Department of Computer Science University of Maryland College Park, MD 20740
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the methodology described.
Open Datasets Yes We consider one-shot learning on Fashion-MNIST [75].
Dataset Splits No The paper does not specify general training, validation, and test dataset splits for reproducibility. It only mentions a specific 'one-shot learning' setup where 'one sample for each class as training dataset' is selected, without defining the overall dataset partitioning.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory, or specific computing platforms) used for conducting the experiments.
Software Dependencies No The paper does not provide specific version numbers for any software libraries, frameworks, or dependencies used in the experiments.
Experiment Setup Yes In Section 8.3, for experiments on Fashion-MNIST, the paper states: 'We use a single hidden layer network with width being 1024 and softplus activation. We use loss l(y, t) = 1/(1 + exp(yt)) and surrogate loss ls(y, t) = log(1 + exp( yt)) for gradient descent. For Gibbs measure, we fix λ = 180. The entropy change is approximately evaluated by integrating Eq. (9) with finite step size and fixed Θ(d). We train 10^5 independent network'. In Section 8.2, it mentions: 'For dynamics in parameter space, we run SGD with finite step size 0.01 and mini-batch size 1.'