Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Revisiting Hidden Representations in Transfer Learning for Medical Imaging

Authors: Dovile Juodelyte, Amelia Jiménez-Sánchez, Veronika Cheplygina

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments, ResNet50 models pre-trained on ImageNet tend to outperform those trained on RadImageNet. To gain further insights, we investigate the learned representations using Canonical Correlation Analysis (CCA) and compare the predictions of the different models. Our results indicate that, contrary to intuition, ImageNet and RadImageNet may converge to distinct intermediate representations, which appear to diverge further during fine-tuning. Despite these distinct representations, the predictions of the models remain similar. Our findings show that the similarity between networks before and after fine-tuning does not correlate with performance gains, suggesting that the advantages of transfer learning might not solely originate from the reuse of features in the early layers of a convolutional neural network.
Researcher Affiliation	Academia	Dovile Juodelyte EMAIL IT University of Copenhagen, Denmark Amelia Jiménez-Sánchez EMAIL IT University of Copenhagen, Denmark Veronika Cheplygina EMAIL IT University of Copenhagen, Denmark
Pseudocode	No	The paper describes methods like Canonical Correlation Analysis (CCA) and prediction similarity using mathematical formulas but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	We make our code and experiments publically available on Github1. 1https://github.com/DovileDo/revisiting-transfer
Open Datasets	Yes	Source. We use publicly available pre-trained Image Net (Deng et al., 2009) and Rad Image Net (Mei et al., 2022) weights as source tasks in our experiments. Target. We investigate transferability to several medical target datasets. [...] 1) Chest. Chest X-rays (Kermany et al., 2018) dataset [...] 2) Breast. Breast ultrasound (Al-Dhabyani et al., 2020) dataset [...] 3) Thyroid. The Digital Database of Thyroid Ultrasound Images (DDTI) (Pedraza et al., 2015) [...] 4) Mammograms. Curated Breast Imaging Subset of Digital Database for Screening Mammography (CBISDDSM) (Sawyer-Lee et al., 2016; Lee et al., 2017; Clark et al., 2013) [...] 5) Knee. MRNet (Bien et al., 2018) [...] 6) PCam-small. Patch Camelyon (Veeling et al., 2018) [...] 7) ISIC. ISIC 2018 Challenge Task 3: Lesion Diagnosis (Codella et al., 2019; Tschandl et al., 2018)
Dataset Splits	Yes	We fine-tune the pre-trained networks on each target dataset using five-fold cross-validation approach. The datasets was split into training (80%), validation (5%), and test (15%) sets. To ensure patient-independent validation where patient information is available (chest, thyroid, mammograms, knee), the target data is split such that the same patient is only present in either the training, validation or test split.
Hardware Specification	Yes	Models were implemented using Keras (Chollet et al., 2015) library and fine-tuned on 3 NVIDIA GeForce RTX 2070 GPU cards.
Software Dependencies	No	The paper mentions "Keras (Chollet et al., 2015) library" but does not specify a version number for Keras or any other software dependency.
Experiment Setup	Yes	We select ResNet50 (He et al., 2016) as the standard model architecture for our experiments. [...] We fine-tuned pre-trained networks using an average pooling layer and a dropout layer with a probability of 0.5. [...] we decided to fix the initial learning rate to a small value (1e-5) for all experiments, and used the Adam optimizer to adapt to each dataset. The models were trained for a maximum of 200 epochs, with early stopping after 30 epochs of no decrease in validation loss, saving the models that achieved the lowest validation loss. [...] As per the approach in Mei et al. (2022), we normalized the images with respect to the ImageNet dataset. To increase the diversity and variability of the training data images were augmented during fine-tuning with the following parameters: rotation range of 10 degrees, width shift range of 0.1, height shift range of 0.1, shear range of 0.1, zoom range of 0.1, fill mode set to nearest , and horizontal flip set to false if the target is chest, otherwise set to true. Table 1 provides details of the image sizes and number of images used for fine-tuning on each target dataset.