reproducibilityindex.ai

Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Authors: Ananya Kumar, Aditi Raghunathan, Robbie Matthew Jones, Tengyu Ma, Percy Liang

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On 10 distribution shift datasets (BREEDS-Living17, BREEDS-Entity30, Domain Net, CIFAR STL, CIFAR-10.1, FMo W, Image Net V2, Image Net-R, Image Net-A, Image Net-Sketch), ﬁne-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We run experiments on ten benchmark datasets with deep neural networks and see that given good pretrained features, ﬁne-tuning (FT) does better ID but worse OOD than linear probing (LP).
Researcher Affiliation	Academia	Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, Percy Liang Stanford University, Computer Science Department
Pseudocode	No	The paper describes methods through mathematical equations and textual descriptions, but does not include any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Updated code is available at https://github.com/Ananya Kumar/transfer_learning and this Coda Lab worksheet.
Open Datasets	Yes	We use standard distribution shift datasets: Domain Net (Peng et al., 2019; Tan et al., 2020), BREEDS-Living-17 (Santurkar et al., 2020), BREEDS-Entity-30 (Santurkar et al., 2020), CIFAR-10 STL (Krizhevsky, 2009; Coates et al., 2011; French et al., 2018), CIFAR-10 CIFAR-10.1 (Recht et al., 2018), Image Net-1K (Russakovsky et al., 2015) where the OOD test sets are Image Net V2 (Recht et al., 2019), Image Net-R (Hendrycks et al., 2020), Image Net A (Hendrycks et al., 2019b), and Image Net-Sketch (Wang et al., 2019) , and FMo W Geo-shift which is adapted from the satellite remote sensing dataset Functional Map of the World (Christie et al., 2018; Koh et al., 2021).
Dataset Splits	Yes	For ﬁne-tuning on each dataset we swept over 6 learning rates, using a cosine learning rate schedule and batch size of 64. We early stop and choose the best learning rate using ID validation accuracy. For linear probing we train an ℓ2-regularized logistic regression classiﬁer on frozen features from the penultimate layer of the pretrained model, selecting the best ℓ2-regularization hyperparameter based on ID validation accuracy.
Hardware Specification	No	The paper mentions deep learning models used (e.g., Res Net-50, Vi T-B/16) but does not specify any particular hardware components like GPU or CPU models, memory sizes, or types of accelerators used for experiments.
Software Dependencies	No	The paper mentions 'Torch Vision' but does not provide specific version numbers for any software libraries, frameworks, or programming languages used in the experiments.
Experiment Setup	Yes	For ﬁne-tuning on each dataset we swept over 6 learning rates, using a cosine learning rate schedule and batch size of 64. We early stop and choose the best learning rate using ID validation accuracy. For linear probing we train an ℓ2-regularized logistic regression classiﬁer on frozen features from the penultimate layer of the pretrained model, selecting the best ℓ2-regularization hyperparameter based on ID validation accuracy.