reproducibilityindex.ai

Great Models Think Alike: Improving Model Reliability via Inter-Model Latent Agreement

Authors: Ailin Deng, Miao Xiong, Bryan Hooi

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Theoretical analysis and extensive experiments on failure detection across various datasets verify the effectiveness of our method on both in-distribution and out-of-distribution settings. We conduct extensive experiments on failure detection to verify the benefits of our framework to improve model reliability and provide theoretical justification for our method.
Researcher Affiliation	Academia	1School of Computing, National University of Singapore, Singapore 2Institute of Data Science, National University of Singapore, Singapore. Correspondence to: Ailin Deng <ailin@u.nus.edu>.
Pseudocode	Yes	We summarize our framework in Algorithm 1 in Appendix. Algorithm 1 Inter-model Latent Agreement
Open Source Code	Yes	Our code is available via https://github.com/ d-ailin/latent-agreement
Open Datasets	Yes	We run experiments on six in-distribution datasets and five distribution shifts to evaluate the failure detection performance. For in-distribution, we use CIFAR10 (Krizhevsky et al.), CIFAR100, STL (Coates et al., 2011), BIRDS (Wah et al., 2011), FOOD (Bossard et al., 2014) and a large-scale dataset, Image Net (Image Net-1K) (Deng et al., 2009).
Dataset Splits	Yes	Table 3. Number of images per data set and associated splits Datasets Classes Train Size Val. Size Test Size Unlabeled Set Size CIFAR10 10 50000 1000 9000 CIFAR100 100 50000 1000 9000 BIRDS 200 5994 2897 2897 STL 10 5000 4000 4000 100000 FOOD 102 75750 12625 12625 Image Net 1000 1281167 10000 40000 -
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using PyTorch Image Models but does not provide specific version numbers for PyTorch, Python, or other software dependencies required for reproducibility.
Experiment Setup	Yes	Section 4.1. Experimental Setup; Section A.2. Training Receipt: For Res Net-50 models, we fine-tune with Adam optimizer with learning rate 1e 4 and (β1, β2) = (0.9, 0.99). For Vi T, we fine-tuned with cosine annealing scheduler. The detail is shown in Table 4. Table 4. Training parameters per data set for Vi T. init-lr: Initial learning rate of the cosine annealing scheduler as selected. steps: Number of batches that was trained on. Section A.4. Hyperparameters: We have training set size n and neighborhood size k as hyperparameters. For main results, except for the ablation study, we use n = 10000 across all datasets... We select k {10, 20, 50, 100, 200, 500, 1000} with optimal AUROC performance on validation split for each dataset.