Modern Neural Networks Generalize on Small Data Sets
Authors: Matthew Olson, Abraham Wyner, Richard Berk
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we use a linear program to empirically decompose fitted neural networks into ensembles of low-bias sub-networks. We show that these sub-networks are relatively uncorrelated which leads to an internal regularization process, very much like a random forest, which can explain why a neural network is surprisingly resistant to overfitting. We then demonstrate this in practice by applying large neural networks, with hundreds of parameters per training observation, to a collection of 116 real-world data sets from the UCI Machine Learning Repository. |
| Researcher Affiliation | Academia | Matthew Olson Department of Statistics Wharton School University of Pennsylvania Philadelphia, PA 19104 maolson@wharton.upenn.edu Abraham J. Wyner Department of Statistics Wharton School University of Pennsylvania Philadelphia, PA 19104 ajw@wharton.upenn.edu Richard Berk Department of Statistics Wharton School University of Pennsylvania Philadelphia, PA 19104 berkr@wharton.upenn.edu |
| Pseudocode | No | The paper describes procedures using mathematical equations and textual explanations, but it does not contain a structured pseudocode block or an explicitly labeled algorithm. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the methodology described. |
| Open Datasets | Yes | In this work, we consider a much richer class of small data sets from the UCI Machine Learning Repository in order to study the generalization paradox. and The collection of data sets we consider were first analyzed in a large-scale study comparing the accuracy of 147 different classifiers [10]. |
| Dataset Splits | No | The paper mentions "All results are reported over 25 randomly chosen 80-20 training-testing splits" and "cross-validated accuracy," but does not explicitly provide details for a separate validation split (e.g., percentage or sample count for validation data) beyond what might be implied by cross-validation. |
| Hardware Specification | No | The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory) used to run the experiments. |
| Software Dependencies | No | The paper mentions components like the Adam optimizer and ELU activation function, but it does not list specific software dependencies (e.g., programming languages, libraries, frameworks) along with their version numbers required for reproducibility. |
| Experiment Setup | Yes | Both networks shared the following architecture and training specifications: 10 hidden layers, 100 nodes per layer, 200 epochs of gradient descent using Adam optimizer with a learning rate of 0.001 [15]. He-initialization for each hidden layer [12], Elu activation function [8]. and More specifically, one network was fit using dropout with a keep-rate of 0.85, while the other network was fit without explicit regularization. |