Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution
Authors: Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, Percy Liang
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On 10 distribution shift datasets (BREEDS-Living17, BREEDS-Entity30, Domain Net, CIFAR STL, CIFAR-10.1, FMo W, Image Net V2, Image Net-R, Image Net-A, Image Net-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We run experiments on ten benchmark datasets with deep neural networks and see that given good pretrained features, fine-tuning (FT) does better ID but worse OOD than linear probing (LP). |
| Researcher Affiliation | Academia | Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang Stanford University, Computer Science Department |
| Pseudocode | No | The paper describes methods through mathematical equations and textual descriptions, but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Updated code is available at https://github.com/Ananya Kumar/transfer_learning and this Coda Lab worksheet. |
| Open Datasets | Yes | We use standard distribution shift datasets: Domain Net (Peng et al., 2019; Tan et al., 2020), BREEDS-Living-17 (Santurkar et al., 2020), BREEDS-Entity-30 (Santurkar et al., 2020), CIFAR-10 STL (Krizhevsky, 2009; Coates et al., 2011; French et al., 2018), CIFAR-10 CIFAR-10.1 (Recht et al., 2018), Image Net-1K (Russakovsky et al., 2015) where the OOD test sets are Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net A (Hendrycks et al., 2019b), and Image Net-Sketch (Wang et al., 2019) , and FMo W Geo-shift which is adapted from the satellite remote sensing dataset Functional Map of the World (Christie et al., 2018; Koh et al., 2021). |
| Dataset Splits | Yes | For fine-tuning on each dataset we swept over 6 learning rates, using a cosine learning rate schedule and batch size of 64. We early stop and choose the best learning rate using ID validation accuracy. For linear probing we train an ℓ2-regularized logistic regression classifier on frozen features from the penultimate layer of the pretrained model, selecting the best ℓ2-regularization hyperparameter based on ID validation accuracy. |
| Hardware Specification | No | The paper mentions deep learning models used (e.g., Res Net-50, Vi T-B/16) but does not specify any particular hardware components like GPU or CPU models, memory sizes, or types of accelerators used for experiments. |
| Software Dependencies | No | The paper mentions 'Torch Vision' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments. |
| Experiment Setup | Yes | For fine-tuning on each dataset we swept over 6 learning rates, using a cosine learning rate schedule and batch size of 64. We early stop and choose the best learning rate using ID validation accuracy. For linear probing we train an ℓ2-regularized logistic regression classifier on frozen features from the penultimate layer of the pretrained model, selecting the best ℓ2-regularization hyperparameter based on ID validation accuracy. |