reproducibilityindex.ai

Predicting the Performance of Foundation Models via Agreement-on-the-Line

Authors: Rahul Saxena, Taeyoun Kim, Aman Mehra, Christina Baek, J. Zico Kolter, Aditi Raghunathan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our work, we demonstrate that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble.
Researcher Affiliation	Collaboration	Carnegie Mellon University1, Bosch Center for AI2
Pseudocode	No	The paper describes the ALine algorithms using mathematical equations but does not present them in a structured pseudocode or clearly labeled algorithm block.
Open Source Code	Yes	Our work has no algorithmic contributions and results can be replicated with https://github.com/kebaek/Agreement-on-the-line
Open Datasets	Yes	Datasets We evaluate ensembles on synthetic corruptions (CIFAR10C, CIFAR100C, Image Net C), dataset replication shifts (CIFAR10.1, Image Net V2), style shifts (Office Home), geographical and temporal shifts (FMo W-WILDS, i Wild Cam-WILDS), and interlaboratory shifts in medicine (Camelyon17-WILDS). Table 1: We evaluate models on the following distribution shift benchmarks. ID OOD CIFAR10 [29] CIFAR10C [22], CIFAR10.1 [45] CIFAR100 [29] CIFAR100C [22] Image Net [48] Image Net C [22], CIFAR10.1 [45]...
Dataset Splits	Yes	Data subsetting We i.i.d. sample p% subset of the data to train over. In the main body, we report models trained on independently sampled 10% of the training data, other proportions of 30% and 50% are reported in Appendix A.4. Given access to a labeled validation set from DID...
Hardware Specification	Yes	We use at most four A6000 s for all experiments except for linear probing where we use one RTX 8000.
Software Dependencies	No	The paper mentions software components like various foundation models (e.g., GPT2, OPT, BERT, CLIP) and optimizers (SGD, AdamW) but does not provide specific version numbers for these or other ancillary software libraries like PyTorch or TensorFlow.
Experiment Setup	Yes	We state here the hyperparameters used to finetune the models for diversity experiments reported in Section 3. Table 6: CLIP Linear Probing Dataset Hyperparameters CIFAR10 Learning Rate: 5e-4 Batch Size: 1028