Modern Neural Networks Generalize on Small Data Sets

Authors: Matthew Olson, Abraham Wyner, Richard Berk

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we use a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. We show that these sub-networks are relatively uncorrelated which leads to an internal regularization process, very much like a random forest, which can explain why a neural network is surprisingly resistant to overfitting. We then demonstrate this in practice by applying large neural networks, with hundreds of parameters per training observation, to a collection of 116 real-world data sets from the UCI Machine Learning Repository.
Researcher Affiliation Academia Matthew Olson Department of Statistics Wharton School University of Pennsylvania Philadelphia, PA 19104 maolson@wharton.upenn.edu Abraham J. Wyner Department of Statistics Wharton School University of Pennsylvania Philadelphia, PA 19104 ajw@wharton.upenn.edu Richard Berk Department of Statistics Wharton School University of Pennsylvania Philadelphia, PA 19104 berkr@wharton.upenn.edu
Pseudocode No The paper describes procedures using mathematical equations and textual explanations, but it does not contain a structured pseudocode block or an explicitly labeled algorithm.
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets Yes In this work, we consider a much richer class of small data sets from the UCI Machine Learning Repository in order to study the generalization paradox. and The collection of data sets we consider were first analyzed in a large-scale study comparing the accuracy of 147 different classifiers [10].
Dataset Splits No The paper mentions "All results are reported over 25 randomly chosen 80-20 training-testing splits" and "cross-validated accuracy," but does not explicitly provide details for a separate validation split (e.g., percentage or sample count for validation data) beyond what might be implied by cross-validation.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory) used to run the experiments.
Software Dependencies No The paper mentions components like the Adam optimizer and ELU activation function, but it does not list specific software dependencies (e.g., programming languages, libraries, frameworks) along with their version numbers required for reproducibility.
Experiment Setup Yes Both networks shared the following architecture and training specifications: 10 hidden layers, 100 nodes per layer, 200 epochs of gradient descent using Adam optimizer with a learning rate of 0.001 [15]. He-initialization for each hidden layer [12], Elu activation function [8]. and More specifically, one network was fit using dropout with a keep-rate of 0.85, while the other network was fit without explicit regularization.