Predicting the Performance of Foundation Models via Agreement-on-the-Line

Authors: Rahul Saxena, Taeyoun Kim, Aman Mehra, Christina Baek, J. Zico Kolter, Aditi Raghunathan

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our work, we demonstrate that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble.
Researcher Affiliation Collaboration Carnegie Mellon University1, Bosch Center for AI2
Pseudocode No The paper describes the ALine algorithms using mathematical equations but does not present them in a structured pseudocode or clearly labeled algorithm block.
Open Source Code Yes Our work has no algorithmic contributions and results can be replicated with https://github.com/kebaek/Agreement-on-the-line
Open Datasets Yes Datasets We evaluate ensembles on synthetic corruptions (CIFAR10C, CIFAR100C, Image Net C), dataset replication shifts (CIFAR10.1, Image Net V2), style shifts (Office Home), geographical and temporal shifts (FMo W-WILDS, i Wild Cam-WILDS), and interlaboratory shifts in medicine (Camelyon17-WILDS). Table 1: We evaluate models on the following distribution shift benchmarks. ID OOD CIFAR10 [29] CIFAR10C [22], CIFAR10.1 [45] CIFAR100 [29] CIFAR100C [22] Image Net [48] Image Net C [22], CIFAR10.1 [45]...
Dataset Splits Yes Data subsetting We i.i.d. sample p% subset of the data to train over. In the main body, we report models trained on independently sampled 10% of the training data, other proportions of 30% and 50% are reported in Appendix A.4. Given access to a labeled validation set from DID...
Hardware Specification Yes We use at most four A6000 s for all experiments except for linear probing where we use one RTX 8000.
Software Dependencies No The paper mentions software components like various foundation models (e.g., GPT2, OPT, BERT, CLIP) and optimizers (SGD, AdamW) but does not provide specific version numbers for these or other ancillary software libraries like PyTorch or TensorFlow.
Experiment Setup Yes We state here the hyperparameters used to finetune the models for diversity experiments reported in Section 3. Table 6: CLIP Linear Probing Dataset Hyperparameters CIFAR10 Learning Rate: 5e-4 Batch Size: 1028