Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Removing Spurious Concepts from Neural Network Representations via Joint Subspace Estimation

Authors: Floris Holstege, Bram Wouters, Noud Van Giersbergen, Cees Diks

ICML 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By evaluating the algorithm on benchmark datasets from computer vision (Waterbirds, Celeb A) and natural language processing (Multi NLI), we show it outperforms existing concept-removal methods in terms of identifying the main-task and spurious concepts, while removing only the latter.
Researcher Affiliation	Academia	1University of Amsterdam, Department of Quantitative Economics 2Tinbergen Institute. Correspondence to: Floris Holstege <EMAIL>.
Pseudocode	Yes	Algorithm 1 JSE algorithm to estimate orthonormal bases for Zsp and Zmt. The conditions in the if-statements are discussed in Section 3.3. Input: a sample {ymt,k, ysp,k, zk}n k=1 consisting of two binary labels and a vector zk Rd. Initialize embedding matrix Z = (z1 z2 zn) . Initialize Z sp Z. for i = 1, ..., d do Zremain Z sp for j = 1, ..., d do Estimate ˆwsp, ˆwmt with Equation 2 (use Zremain).
Open Source Code	Yes	Our code with an implementation of JSE is publicly available.* *https://github.com/fholstege/JSE
Open Datasets	Yes	Waterbirds: this dataset from Sagawa et al. (2020b) is a combination of the Places dataset (Zhou et al., 2016) and the CUB dataset (Welinder et al., 2010)... Celeb A: this dataset contains images of celebrity faces (Liu et al., 2015)... Multi NLI: the Multi NLI dataset (Williams et al., 2018)...
Dataset Splits	Yes	For a given dataset size (e.g. n =2,000) the data is split into an 80% training and 20% validation set, and a test set of the same size is kept apart for evaluation.
Hardware Specification	No	The paper specifies the neural network architectures (Res Net50, BERT) and software libraries (torchvision, transformers) used, but does not provide any concrete details about the specific hardware (e.g., GPU models, CPU types) on which experiments were run.
Software Dependencies	No	The paper mentions 'torchvision package', 'transformers package', 'Pytorch', and 'Adam optimizer' but does not specify exact version numbers for these software dependencies, which is required for reproducibility.
Experiment Setup	Yes	For waterbirds, this means using a learning rate of 10 3, a weight decay of 10 3, a batch size of 32, and for 100 epochs without early stopping. For Celeb A, this means using a learning rate of 10 3, a weight decay of 10 4, a batch size of 128, and for 50 epochs without early stopping.