Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fine-tuning can cripple your foundation model; preserving features may be the solution

Authors: Jishnu Mukhoti, Yarin Gal, Philip Torr, Puneet K. Dokania

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive experiments on 10 fine-tuning tasks we show that LDIFS significantly reduces concept forgetting. Additionally, we show that LDIFS is highly effective in performing continual fine-tuning on a sequence of tasks as well, in comparison with both fine-tuning as well as continual learning baselines.
Researcher Affiliation	Collaboration	Jishnu Mukhoti EMAIL Department of Engineering Science, University of Oxford Yarin Gal EMAIL Department of Computer Science, University of Oxford Philip H.S. Torr EMAIL Department of Engineering Science, University of Oxford Puneet K. Dokania EMAIL Department of Engineering Science, University of Oxford, Five AI
Pseudocode	No	The paper describes methods using mathematical formulations (e.g., equations for L2SP and LDIFS loss functions) and textual descriptions of steps, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	No	The paper mentions using a pre-trained CLIP model from the Open CLIP repository (Ilharco et al. (2021)) and refers to 'Following the code of Ilharco et al. (2022a)' for training details. However, it does not provide any explicit statement or link for the open-sourcing of the authors' own implementation code for the proposed LDIFS method.
Open Datasets	Yes	To quantify concept forgetting, here we use CLIP Radford et al. (2021) Vi T-B/32 pre-trained on the Open AI dataset and released in the Open CLIP repository Ilharco et al. (2021) and measure its LP performance over fine-tuning on 10 different image classification downstream tasks with a high variability in their semantic concepts. These datasets, along with their respective train/test splits are: 1. Stanford Cars Krause et al. (2013) 2. CIFAR-10/100 (C10/100) Krizhevsky et al. (2009) 3. DTD Cimpoi et al. (2014) 4. Euro SAT Helber et al. (2019) 5. GTSRB Stallkamp et al. (2012) 6. MNIST Le Cun et al. (1998) 7. RESISC45 (R45) Cheng et al. (2017) 8. SVHN Netzer et al. (2011) 9. Image Net Deng et al. (2009)
Dataset Splits	Yes	1. Stanford Cars Krause et al. (2013) containing 16, 185 images of 196 classes of cars with a train/test split of 8144 and 8041 images respectively, 2. CIFAR-10/100 (C10/100) Krizhevsky et al. (2009) containing 60, 000 images of vehicles, flora and fauna, divided into 10/100 classes with the train/test split having 50, 000 and 10, 000 images respectively, 3. DTD Cimpoi et al. (2014) containing 3760 images of 47 classes of textures found in the wild with 1880 images each in the train and test sets, 4. Euro SAT Helber et al. (2019) containing 25, 000 samples with 10 categories of satellite images of landscapes and 19, 600/5400 training/test images respectively, 5. GTSRB Stallkamp et al. (2012) containing 39, 270 images of 43 classes of German traffic signs with 26, 640 training images and 12, 630 test images, 6. MNIST Le Cun et al. (1998) containing 60, 000 training images and 10, 000 test images of 10 handwritten digits from 0 to 9 in grayscale, 7. RESISC45 (R45) Cheng et al. (2017) containing 25, 200 samples with 45 classes of various remote sensing image scenes with the train/test split having 18, 900 and 6300 images respectively, 8. SVHN Netzer et al. (2011) containing a total of 99, 289 colour images of street view house numbers, each image being categorized into one of 10 digits with 73, 257 training samples and 26, 032 test samples. 9. Image Net Deng et al. (2009) containing a total of 1.28 million training images and 50, 000 validation images of 1000 classes.
Hardware Specification	Yes	Each of our fine-tuning runs is done on a single NVIDIA A100 GPU.
Software Dependencies	No	We train using the Adam W Loshchilov & Hutter (2017) optimizer, with an initial learning rate of 1e 5, a weight decay of 0.1 and a cosine learning rate scheduler with a warmup length of 500. For all the runs, we use a batch-size of 128. Following the code of Ilharco et al. (2022a), we use the following number of epochs to fine-tune each dataset... (further details on epochs and lambda regularization) The paper mentions 'Adam W' optimizer and 'scikit-learn s Logistic Regression module Pedregosa et al. (2011)', but does not provide specific version numbers for these software components or any other libraries/frameworks used for the implementation.
Experiment Setup	Yes	We train using the Adam W Loshchilov & Hutter (2017) optimizer, with an initial learning rate of 1e 5, a weight decay of 0.1 and a cosine learning rate scheduler with a warmup length of 500. For all the runs, we use a batch-size of 128. Following the code of Ilharco et al. (2022a), we use the following number of epochs to fine-tune each dataset: a) Stanford Cars: 35 epochs, b) CIFAR-10/100: 10 epochs, c) DTD: 76 epochs, d) Euro SAT: 12 epochs, e) GTSRB: 11 epochs, f) MNIST: 10 epochs, g) RESISC45: 15 epochs, h) SVHN: 10 epochs and j) Image Net: 10 epochs. We keep a minimum of 10 epochs for fine-tuning. Choosing λLDIFS: One hyper-parameter which the LDIFS regularizer introduces is λLDIFS. A higher value of λLDIFS encourages the model to preserve features of the original foundation model and vice versa. For each classification task, we performed a grid search over λLDIFS {0.01, 0.05, 0.1, 0.5, 1, 10, 100} and cross-validated this hyper-parameter on a held-out validation set, choosing the value which produces the best performance on the validation set. We found λLDIFS = 10 to produce the best performance over datasets in general, so all the results we present in this paper are with λLDIFS set to 10.