Predicting the Performance of Foundation Models via Agreement-on-the-Line
Authors: Rahul Saxena, Taeyoun Kim, Aman Mehra, Christina Baek, J. Zico Kolter, Aditi Raghunathan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our work, we demonstrate that when lightly finetuning multiple runs from a single foundation model, the choice of randomness during training (linear head initialization, data ordering, and data subsetting) can lead to drastically different levels of agreement-on-the-line in the resulting ensemble. |
| Researcher Affiliation | Collaboration | Carnegie Mellon University1, Bosch Center for AI2 |
| Pseudocode | No | The paper describes the ALine algorithms using mathematical equations but does not present them in a structured pseudocode or clearly labeled algorithm block. |
| Open Source Code | Yes | Our work has no algorithmic contributions and results can be replicated with https://github.com/kebaek/Agreement-on-the-line |
| Open Datasets | Yes | Datasets We evaluate ensembles on synthetic corruptions (CIFAR10C, CIFAR100C, Image Net C), dataset replication shifts (CIFAR10.1, Image Net V2), style shifts (Office Home), geographical and temporal shifts (FMo W-WILDS, i Wild Cam-WILDS), and interlaboratory shifts in medicine (Camelyon17-WILDS). Table 1: We evaluate models on the following distribution shift benchmarks. ID OOD CIFAR10 [29] CIFAR10C [22], CIFAR10.1 [45] CIFAR100 [29] CIFAR100C [22] Image Net [48] Image Net C [22], CIFAR10.1 [45]... |
| Dataset Splits | Yes | Data subsetting We i.i.d. sample p% subset of the data to train over. In the main body, we report models trained on independently sampled 10% of the training data, other proportions of 30% and 50% are reported in Appendix A.4. Given access to a labeled validation set from DID... |
| Hardware Specification | Yes | We use at most four A6000 s for all experiments except for linear probing where we use one RTX 8000. |
| Software Dependencies | No | The paper mentions software components like various foundation models (e.g., GPT2, OPT, BERT, CLIP) and optimizers (SGD, AdamW) but does not provide specific version numbers for these or other ancillary software libraries like PyTorch or TensorFlow. |
| Experiment Setup | Yes | We state here the hyperparameters used to finetune the models for diversity experiments reported in Section 3. Table 6: CLIP Linear Probing Dataset Hyperparameters CIFAR10 Learning Rate: 5e-4 Batch Size: 1028 |