Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Latent State Models of Training Dynamics

Authors: Michael Y. Hu, Angelica Chen, Naomi Saphra, Kyunghyun Cho

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To understand the effect of randomness on the dynamics and outcomes of neural network training, we train models multiple times with different random seeds and compute a variety of metrics throughout training, such as the L2 norm, mean, and variance of the neural network s weights. We then fit a hidden Markov model (HMM; Baum & Petrie, 1966) over the resulting sequences of metrics. ... we train HMMs on training trajectories derived from grokking tasks, language modeling, and image classification across a variety of model architectures and sizes.
Researcher Affiliation	Collaboration	Michael Y. Hu EMAIL New York University Angelica Chen EMAIL New York University Naomi Saphra EMAIL New York University Kyunghyun Cho EMAIL New York University Prescient Design, Genentech CIFAR LMB
Pseudocode	No	The paper describes the methodology and algorithms in prose and mathematical formulas, but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures.
Open Source Code	Yes	Our code is available at https://github.com/michahu/modeling-training.
Open Datasets	Yes	We collect 40 runs of ResNet18 (He et al., 2016) trained on CIFAR-100 (Krizhevsky, 2009)... The dynamics of MNIST are similar to that of CIFAR-100. We collect 40 training runs of a two-layer MLP learning image classification on MNIST, with hyperparameters based on Simard et al. (2003).
Dataset Splits	Yes	We collect trajectories using 40 random seeds and train and validate the HMM on a random 80-20 validation split, a split that we use for all settings. ... Training data size 50000 (splits downloaded from PyTorch) ... Training data size 60000 (splits downloaded from PyTorch)
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, processors, or cloud instance types used for running the experiments. It only mentions general training processes.
Software Dependencies	No	The paper mentions software components like "PyTorch" and optimizers like "Adam W" and "SGD," but it does not specify any version numbers for these or other software dependencies.
Experiment Setup	Yes	For all hyperparameter details, see Appendix D. ... Appendix D Training Hyperparameters: Hyperparameter Value Learning Rate 1e-1 Batch Size 32 Training data size (randomly generated) 1000 Architecture Multilayer perceptron Number of hidden layers 1 Model Hidden Size 128 Weight Decay 0.01 Seed 0 through 40 Optimizer SGD