Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Predicting Out-of-Domain Generalization with Neighborhood Invariance
Authors: Nathan Hoyen Ng, Neha Hulkund, Kyunghyun Cho, Marzyeh Ghassemi
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In experiments on robustness benchmarks in image classification, sentiment analysis, and natural language inference, we demonstrate a strong and robust correlation between our neighborhood invariance measure and actual OOD generalization on over 4,600 models evaluated on over 100 unique train/test domain pairs. |
| Researcher Affiliation | Collaboration | Nathan Ng EMAIL University of Toronto Vector Institute Massachusetts Institute of Technology Neha Hulkund EMAIL Massachusetts Institute of Technology Kyunghyun Cho EMAIL New York University Prescient Design, Genentech CIFAR Fellow Marzyeh Ghassemi EMAIL Massachusetts Institute of Technology CIFAR AI Chair Vector Institute |
| Pseudocode | No | The paper provides mathematical definitions (Eq. 1-4) for its measure but does not include a distinct block explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | The paper mentions using the 'fairseq framework' and 'pre-trained RoBERTa BASE model provided by fairseq', but does not explicitly state that the authors are releasing their own implementation code for the methodology described in this paper, nor does it provide a link to such a repository. |
| Open Datasets | Yes | We select common OOD benchmark datasets in image classification (Krizhevsky, 2009; Lu et al., 2020; Recht et al., 2018; Deng, 2012; Darlow et al., 2018; Netzer et al., 2011; Arjovsky et al., 2019; Taori et al., 2020), sentiment analysis (Ni et al., 2019), and natural language inference (Williams et al., 2018) |
| Dataset Splits | Yes | We split the dataset into 10 different domains based on review category. For all domains and datasets, models are trained to predict a reviewβs star rating from 1 to 5. Natural Language Inference (NLI) We use the MNLI (Williams et al., 2018) dataset, a corpus of NLI data from 10 distinct genres of written and spoken English. We train on the 5 genres with training data and evaluate on all 10 genres. |
| Hardware Specification | Yes | All models are trained on a single RTX6000 GPU. Our RoBERTa models ... trained on a single RTX6000 or T4 GPU. |
| Software Dependencies | No | The paper mentions 'Adam optimizer (Kingma & Ba, 2014)' and 'fairseq framework (Ott et al., 2019)', but does not provide specific version numbers for any software libraries or dependencies used for implementation, beyond the cited papers themselves. |
| Experiment Setup | Yes | The total number of models trained and converged in each pool as well as details on hyperparameter variations for each task and model provided in Table 1. We include further details on model training, the hyperparameter space, and specific choices in hyperparameters in Appendix A.4, A.2, and A.3. All models are trained with the Adam optimizer (Kingma & Ba, 2014) with Ξ² = (0.9, 0.98) and Ο΅ = 1 Γ 10β6. CNN models are trained with learning rate 1 Γ 10β3 and RoBERTa models are trained with learning rate 1 Γ 10β5. We use a inverse square root learning rate scheduler to anneal learning rate over training. We early stop CNN models on sentiment analysis at 0.04 cross entropy and on NLI at 0.03 cross entropy. We early stop RoBERTa models on sentiment analysis at 0.05 cross entropy and on NLI at 0.03 cross entropy. |