Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predicting Out-of-Domain Generalization with Neighborhood Invariance

Authors: Nathan Hoyen Ng, Neha Hulkund, Kyunghyun Cho, Marzyeh Ghassemi

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our neighborhood invariance measure and actual OOD generalization on over 4,600 models evaluated on over 100 unique train/test domain pairs.
Researcher Affiliation	Collaboration	Nathan Ng EMAIL University of Toronto Vector Institute Massachusetts Institute of Technology Neha Hulkund EMAIL Massachusetts Institute of Technology Kyunghyun Cho EMAIL New York University Prescient Design, Genentech CIFAR Fellow Marzyeh Ghassemi EMAIL Massachusetts Institute of Technology CIFAR AI Chair Vector Institute
Pseudocode	No	The paper provides mathematical definitions (Eq. 1-4) for its measure but does not include a distinct block explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	The paper mentions using the 'fairseq framework' and 'pre-trained RoBERTa BASE model provided by fairseq', but does not explicitly state that the authors are releasing their own implementation code for the methodology described in this paper, nor does it provide a link to such a repository.
Open Datasets	Yes	We select common OOD benchmark datasets in image classification (Krizhevsky, 2009; Lu et al., 2020; Recht et al., 2018; Deng, 2012; Darlow et al., 2018; Netzer et al., 2011; Arjovsky et al., 2019; Taori et al., 2020), sentiment analysis (Ni et al., 2019), and natural language inference (Williams et al., 2018)
Dataset Splits	Yes	We split the dataset into 10 different domains based on review category. For all domains and datasets, models are trained to predict a review’s star rating from 1 to 5. Natural Language Inference (NLI) We use the MNLI (Williams et al., 2018) dataset, a corpus of NLI data from 10 distinct genres of written and spoken English. We train on the 5 genres with training data and evaluate on all 10 genres.
Hardware Specification	Yes	All models are trained on a single RTX6000 GPU. Our RoBERTa models ... trained on a single RTX6000 or T4 GPU.
Software Dependencies	No	The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' and 'fairseq framework (Ott et al., 2019)', but does not provide specific version numbers for any software libraries or dependencies used for implementation, beyond the cited papers themselves.
Experiment Setup	Yes	The total number of models trained and converged in each pool as well as details on hyperparameter variations for each task and model provided in Table 1. We include further details on model training, the hyperparameter space, and specific choices in hyperparameters in Appendix A.4, A.2, and A.3. All models are trained with the Adam optimizer (Kingma & Ba, 2014) with β = (0.9, 0.98) and ϵ = 1 × 10−6. CNN models are trained with learning rate 1 × 10−3 and RoBERTa models are trained with learning rate 1 × 10−5. We use a inverse square root learning rate scheduler to anneal learning rate over training. We early stop CNN models on sentiment analysis at 0.04 cross entropy and on NLI at 0.03 cross entropy. We early stop RoBERTa models on sentiment analysis at 0.05 cross entropy and on NLI at 0.03 cross entropy.