Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Objective drives the consistency of representational similarity across datasets

Authors: Laure Ciernik, Lorenz Linhardt, Marco Morik, Jonas Dippel, Simon Kornblith, Lukas Muttenthaler

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments investigate the consistency of representational similarities between vision models across various datasets using our framework. Following the experimental setup (4.1), we first analyze how model similarities transfer across datasets (4.2, 4.4) and find consistent model groups through similarity clustering (4.3). We then investigate how model and dataset characteristics influence similarity consistency (4.5 and 4.6), and conclude by examining the relationship between representational similarities and models performance differences on downstream tasks (4.7).
Researcher Affiliation	Collaboration	1Machine Learning Group, Technische Universit at Berlin, Berlin, Germany 2Hector Fellow Academy, Karlsruhe, Germany 3European Laboratory for Learning and Intelligent Systems (ELLIS), T ubingen, Germany 4BIFOLD Berlin Institute for the Foundations of Learning and Data, Berlin, Germany 5Aignostics, Berlin, Germany 6Anthropic, California, United States of America.
Pseudocode	No	The paper describes methodologies using natural language and flowcharts (e.g., Fig. 1), but no formal pseudocode or algorithm blocks are explicitly present.
Open Source Code	Yes	The code and the data to run our analyses and reproduce the experimental results are publicly available at https://github.com/lciernik/similarity_consistency.
Open Datasets	Yes	Datasets. We evaluated pairwise model similarities across 20 datasets from the CLIP benchmark (Cherti & Beaumont, 2022) and 3 datasets from Breeds (Santurkar et al., 2021). This set includes various VTAB datasets as well as Image Net-1k. Following the categorization proposed in (Zhai et al., 2020), we classified the datasets into three main types: natural (e.g., Image Net-1k), specialized (e.g., PCAM), and structured (e.g., DTD) image datasets. Furthermore, we partition the natural image datasets into single- and multi-domain categories. A list of all datasets can be found in Tab. 1.
Dataset Splits	Yes	For each dataset, we selected the training split, potentially subsetted for large datasets as described in Appx. C, to compute the model representational similarities, while we use the validation or test split to compute downstream performance. Each model was evaluated with 3 different random seeds on each dataset, and the mean top-1 accuracy across seeds was used. Hyperparameter selection was performed with a validation set of 20% of the training set and the remaining training data for optimization.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using the Python package thingsvision (Muttenthaler & Hebart, 2021) and torchvision, but does not provide specific version numbers for these or other software dependencies like the Adam W optimizer.
Experiment Setup	Yes	Hyperparameter selection was performed with a validation set of 20% of the training set and the remaining training data for optimization. We followed the binary search procedure described in (Radford et al., 2021) and searched for the optimal weight decay parameter λ in the interval between 10 6 and 102 in 96 logarithmically spaced steps. This was done for all learning rates η {10i}4 i=1. After hyperparameter selection, the linear probe was retrained on the full training set and evaluated on the respective test set (validation set for Imagenet-1k). All linear probes are trained for 20 epochs, using the Adam W optimizer (Loshchilov & Hutter, 2019) and a cosine schedule for learning rate decay (Loshchilov & Hutter, 2017).