Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Inducing Causal Structure for Interpretable Neural Networks

Authors: Atticus Geiger, Zhengxuan Wu, Hanson Lu, Josh Rozner, Elisa Kreiss, Thomas Icard, Noah Goodman, Christopher Potts

ICML 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate IIT on a structural vision task (MNIST-PVR), a navigational language task (Rea SCAN), and a natural language inference task (MQNLI). We compare IIT against multi-task training objectives and data augmentation. In all our experiments, IIT achieves the best results and produces neural models that are more interpretable in the sense that they more successfully realize the target causal model.
Researcher Affiliation	Academia	1Stanford University, Stanford, California. Correspondence to: Atticus Geiger <EMAIL>, Zhengxuan Wu <EMAIL>.
Pseudocode	Yes	Pseudocode for interchange intervention training.
Open Source Code	Yes	We release our code at https://github.com/frankaging/Interchange-Intervention-Training.
Open Datasets	Yes	Our ﬁrst benchmark is MNIST Pointer-Value Retrieval (MNIST-PVR; Zhang et al. 2021), a visual reasoning task constructed using the MNIST dataset (Le Cun et al., 2010). and Our second benchmark is Rea SCAN (Wu et al., 2021), a synthetic command-based navigation task that builds off the SCAN (Lake & Baroni, 2018) and g SCAN (Ruis et al., 2020) benchmarks. and Our ﬁnal benchmark is MQNLI (Geiger et al., 2019), a synthetic natural language inference dataset
Dataset Splits	Yes	The train/test split designed by Zhang et al. (2021) creates a distributional shift between the training and testing data and The best model is picked by performance on a smaller development set of 2,000 examples, which is consistent with the training pipeline proposed in Ruis et al. (2020) for g SCAN. and For our experiments, we used a train set with 500K examples, a dev set with 60k examples, and a test set with 10K examples
Hardware Specification	Yes	The training time is about 1 day on a Standard Ge Force RTX 2080 Ti GPU with 11GB memory.
Software Dependencies	No	The paper mentions software components like PyTorch vision, Adam optimizer, and BERT, and the 'antra package' but does not provide specific version numbers for these software dependencies (e.g., 'PyTorch 1.9' or 'TensorFlow 2.x').
Experiment Setup	Yes	The learning rate starts at 1e 4 and decays by 0.9 every 20,000 steps. We train the model for a ﬁxed number of epochs (100,000) before stopping. and We use a batch size of 32. We use 5.0 10 5 as our learning rate, and use adamw optimization. We train for a maximum of 5 epochs.