reproducibilityindex.ai

Surgical Fine-Tuning Improves Adaptation to Distribution Shifts

Authors: Yoonho Lee, Annie S Chen, Fahim Tajwar, Ananya Kumar, Huaxiu Yao, Percy Liang, Chelsea Finn

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main contribution is the empirical observation that fine-tuning only a small contiguous subset of layers can outperform full fine-tuning on a range of distribution shifts. Intriguingly, the best layers to tune differ for different distribution shift types (Figure 1). This finding is validated empirically across seven real-world datasets and three types of distribution shifts, and theoretically in an idealized two-layer neural network setup.
Researcher Affiliation	Academia	Yoonho Lee Stanford University Annie S. Chen Stanford University Fahim Tajwar Stanford University Ananya Kumar Stanford University Huaxiu Yao Stanford University Percy Liang Stanford University Chelsea Finn Stanford University
Pseudocode	No	The paper describes methods and theoretical analyses but does not include explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link to its own open-source code for the methodology described.
Open Datasets	Yes	For example, the source dataset can be the 50, 000 training images in CIFAR-10 (Krizhevsky et al., 2009) while the target dataset is a smaller set of 1000 corrupted CIFAR datapoints with the same image corruption (Hendrycks & Dietterich, 2019); see Figure 1 for more examples of source-target dataset pairs that we consider.Datasets. We run experiments on nine real-world distribution shifts, categorized into input-level, feature-level, output-level, and natural shifts, with examples shown in Figure 1. For more details about these datasets, see Appendix B.3.
Dataset Splits	Yes	In all experiments, we perform early stopping on held-out target data according to the fine-tuning loss.We choose the best hyperpameters and early stop based on accuracy on held-out target data.Also, from Appendix B.4:For all datasets and experiments, we early stop according to the best accuracy on a held-out validation subset of the labeled target data.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running its experiments.
Software Dependencies	No	The paper mentions optimizers (Adam, SGD, Adam W) but does not provide specific software library names with version numbers (e.g., PyTorch 1.9, TensorFlow 2.x) that would allow for reproducible setup of the software environment.
Experiment Setup	Yes	We fine-tune with the Adam optimizer, sweeping over 3 learning rates. We choose the best hyperpameters and early stop based on accuracy on held-out target data. We report results across 3 seeds for all experiments. See Appendix B.4 for more fine-tuning details. Appendix B.4 provides specific learning rate ranges, weight decay values, and number of epochs for various datasets, e.g., We tune over the 3 learning rates {1e-3, 1e-4, 1e-5} for all methods except last-layer fine-tuning, where we tune over {1e-1, 1e-2, 1e-3}, and we use a weight decay of 0.0001 for all methods. and We fine-tune on the labeled target data for 15 total epochs. and We use a cosine annealing learning rate scheduler (Loshchilov & Hutter, 2017) and batch size 32.