Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Authors: Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, Percy Liang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On 10 distribution shift datasets (BREEDS-Living17, BREEDS-Entity30, Domain Net, CIFAR STL, CIFAR-10.1, FMo W, Image Net V2, Image Net-R, Image Net-A, Image Net-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We run experiments on ten benchmark datasets with deep neural networks and see that given good pretrained features, fine-tuning (FT) does better ID but worse OOD than linear probing (LP).
Researcher Affiliation Academia Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang Stanford University, Computer Science Department
Pseudocode No The paper describes methods through mathematical equations and textual descriptions, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes Updated code is available at https://github.com/Ananya Kumar/transfer_learning and this Coda Lab worksheet.
Open Datasets Yes We use standard distribution shift datasets: Domain Net (Peng et al., 2019; Tan et al., 2020), BREEDS-Living-17 (Santurkar et al., 2020), BREEDS-Entity-30 (Santurkar et al., 2020), CIFAR-10 STL (Krizhevsky, 2009; Coates et al., 2011; French et al., 2018), CIFAR-10 CIFAR-10.1 (Recht et al., 2018), Image Net-1K (Russakovsky et al., 2015) where the OOD test sets are Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net A (Hendrycks et al., 2019b), and Image Net-Sketch (Wang et al., 2019) , and FMo W Geo-shift which is adapted from the satellite remote sensing dataset Functional Map of the World (Christie et al., 2018; Koh et al., 2021).
Dataset Splits Yes For fine-tuning on each dataset we swept over 6 learning rates, using a cosine learning rate schedule and batch size of 64. We early stop and choose the best learning rate using ID validation accuracy. For linear probing we train an ℓ2-regularized logistic regression classifier on frozen features from the penultimate layer of the pretrained model, selecting the best ℓ2-regularization hyperparameter based on ID validation accuracy.
Hardware Specification No The paper mentions deep learning models used (e.g., Res Net-50, Vi T-B/16) but does not specify any particular hardware components like GPU or CPU models, memory sizes, or types of accelerators used for experiments.
Software Dependencies No The paper mentions 'Torch Vision' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup Yes For fine-tuning on each dataset we swept over 6 learning rates, using a cosine learning rate schedule and batch size of 64. We early stop and choose the best learning rate using ID validation accuracy. For linear probing we train an ℓ2-regularized logistic regression classifier on frozen features from the penultimate layer of the pretrained model, selecting the best ℓ2-regularization hyperparameter based on ID validation accuracy.