Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Inconsistency, Instability, and Generalization Gap of Deep Neural Network Training
Authors: Rie Johnson, Tong Zhang
NeurIPS 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical study based on this analysis shows that instability and inconsistency are strongly predictive of generalization gap in various settings. Our empirical study consists of three parts. |
| Researcher Affiliation | Collaboration | Rie Johnson RJ Research Consulting New York, USA EMAIL Tong Zhang HKUST Hong Kong EMAIL This work was done when the second author was jointly with Google Research. |
| Pseudocode | Yes | Algorithm 1: Training with consistency encouragement. Algorithm 2: Our semi-supervised variant of co-distillation. |
| Open Source Code | No | The paper mentions using publicly available models and tools (e.g., Efficient Net-B0, SAM from github.com/google-research/sam), but it does not state that its own methodology implementation or novel code is open-sourced or provide a link to its own code repository. |
| Open Datasets | Yes | Table 1: Datasets. Name: Image Net [7], Food101 [3], Dogs [21], Cars [23], CIFAR-10 [24], MNLI [36], QNLI [36]. |
| Dataset Splits | Yes | The expectation values involved in CP and SP were estimated by taking the average; in particular, the expectation over data distribution Z was estimated on the held-out unlabeled data disjoint from training data or test data. (K, J) was set to (4,8) for CIFAR-10/100 and (4,4) for Image Net, and the size of each training set was set to 4K for CIFAR-10/100 and 120K (10%) for Image Net. ... on the development data (held-out 5K data points). |
| Hardware Specification | Yes | All the experiments were done using GPUs (A100 or older). ... no TPU |
| Software Dependencies | No | The paper mentions software components like SGD, Adam W, and references models like RoBERTa-base, but it does not provide specific version numbers for any libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used. |
| Experiment Setup | Yes | Table 10: Basic settings shared by all the models for each case (Case#1 7,10; images) Training type From scratch Fine-tuning Distillation Dataset Network Batch size Epochs Update steps Warmup steps Learning rate Schedule Optimizer Weight decay Label smooth Iterate averaging Gradient clipping Data augment. Table 12: Hyperparameters for SAM. Case#: 1 2 3 4 5 6 7 8,9 10 m-sharpness ρ |