Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

DORA: Exploring Outlier Representations in Deep Neural Networks

Authors: Kirill Bykov, Mayukh Deb, Dennis Grinwald, Klaus Robert Muller, Marina MC Höhne

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the EA metric quantitatively, demonstrating its eﬀectiveness both in controlled scenarios and real-world applications. Lastly, through practical experiments conducted on popular Computer Vision models, we reveal that anomalous representations identiﬁed by our framework often correspond to undesirable spurious concepts. To quantitatively evaluate the alignment, we compared human-deﬁned semantic distances between concepts, which we refer to as semantic baselines, with distance matrices computed between representations trained to learn these concepts.
Researcher Affiliation	Collaboration	Klaus-Robert Müller EMAIL Machine Learning Group Technical University of Berlin, Berlin, Germany BIFOLD Berlin Institute for the Foundations of Learning and Data, Berlin, Germany Department of Artiﬁcial Intelligence, Korea University, Seoul 136-713, Korea Max Planck Institut für Informatik, 66123 Saarbrücken, Germany Google Research, Brain Team, Berlin, Germany
Pseudocode	No	The paper includes mathematical definitions and formulas but does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Py Torch implementation of the proposed method can be found by the following link: https://github.com/lapalap/dora .
Open Datasets	Yes	For our study, we utilized two prevalent computer vision datasets, namely ILSVRC-2012 Russakovsky et al. (2015) and CIFAR-100 Krizhevsky (2009). The combined dataset comprised the Tiny Imagenet Le and Yang (2015), containing 200 Image Net classes, and the MNIST handwritten-numbers dataset Deng (2012), containing 10 handwritten numbers, resulting in a total of 210 classes.
Dataset Splits	Yes	The data set itself consists of 224,316 training, 200 validation, and 500 test data points. For our empirical analysis, we utilized a pre-trained Res Net18 model on Image Net, along with the ILSVRC-2012 validation set consisting of 50,000 images and 1,000 classes, employed for the data-aware metrics.
Hardware Specification	No	All described experiments, if not stated otherwise, were performed on the Google Colab Pro Bisong and Bisong (2019) environment with the GPU accelerator. This statement is too general and does not specify exact GPU models, CPU models, or memory details.
Software Dependencies	No	The paper mentions software components like "Py Torch implementation", "NLTK package", "Torchvision library", "pytorch-vision-models library", "Pytorch-cifar100 Git Hub repository", and "Lucent library" but does not provide specific version numbers for any of them.
Experiment Setup	Yes	We computed functional distances with optimal hyperparameters found in Section 5.1, including Minkowski p = 1, Pearson, Spearman, EAn with n = 50, d = 200, and EAs with n = 3, m = 500, on the output logit layer for each model.