Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Causal Differentiating Concepts: Interpreting LM Behavior via Causal Representation Learning

Authors: Navita Goyal, Hal Daumé III, Alexandre Drouin, Dhanya Sridhar

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the ability of our method to disentangle mediating concepts in settings where we have some domain knowledge about what the desired mediating concepts are. We compare our method to baselines without disentanglement guarantees and sparse autoencoders (SAEs), and find that our method outperforms these related methods. ... Data. We conduct our experiments in three settings: (1) synthetic data, (2) semi-synthetic data with real text and synthetic labels, and (3) non-synthetic data with text and LM outputs. ... Evaluation metrics. We evaluate the effectiveness of our method at recovering the ground-truth causal factors using the disentanglement-completeness-informativeness (DCI) metrics (Eastwood and Williams, 2018). ... Table 1 shows a comparison between our method and the baseline methods for synthetic data. ... Table 2 shows results on the semi-synthetic data.
Researcher Affiliation Collaboration Navita Goyal University of Maryland EMAIL Hal Daumé III University of Maryland EMAIL Alexandre Drouin Service Now Research Mila-Quebec AI Institute EMAIL Dhanya Sridhar Mila-Quebec AI Institute Université de Montréal EMAIL
Pseudocode No The paper describes the methodology in prose (Section 3) and provides proofs in Appendix A, but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes We provide the data and code, with instructions to reproduce the experimental results in the supplementary material.
Open Datasets Yes For semi-synthetic data, we sample 20,000 bios from Bias Bios dataset (De-Arteaga et al., 2019), which is available under Apache-2.0 license. ... Harmful examples are sampled uniformly from the MALICIOUSINSTRUCT (Huang et al. (2024), CC BY-SA-4.0 License), HARMBENCH (Mazeika et al. (2024), MIT License), ADVBENCH (Zou et al. (2023), MIT License), and TDC2023 (Mazeika et al. (2022), MIT License) datasets. Pseudo-harmful examples are sampled from OR-BENCH-80K (Cui et al. (2025), CC BY-4.0 License).
Dataset Splits Yes All datasets follow a 70:15:15 train-validation-test split.
Hardware Specification Yes The experiments in this paper were conducted on machines equipped with Tesla P100-PCIE-12GB GPUs.
Software Dependencies No We use PyTorch5 and Hugging Face Transformers6 libraries for our experiments. For experiments with synthetic data, we train our models using the Adam optimizer and a learning rate scheduler that reduces the learning rate when the validation loss plateaus. The model is trained for 50 epochs and the best checkpoint is selected based on the validation loss. For semi-synthetic and non-synthetic experiments, we use the default optimizer and scheduler provided in the Transformer training utils (Adam W and a linear learning rate scheduler). The model is trained for 3 epochs.
Experiment Setup Yes For experiments with synthetic data, we train our models using the Adam optimizer and a learning rate scheduler that reduces the learning rate when the validation loss plateaus. The model is trained for 50 epochs and the best checkpoint is selected based on the validation loss. For semi-synthetic and non-synthetic experiments, we use the default optimizer and scheduler provided in the Transformer training utils (Adam W and a linear learning rate scheduler). The model is trained for 3 epochs. In the semi-synthetic setting, the number and size of layers in the bottleneck modules are treated as hyperparameters (with nlayers [2, 4, 8, 16] and hd [64, 128, 256, 512]), with the hidden dimension of the final layers fixed to d. Hyperparameter selection is performed with grid search using the Ray Tune library7 optimizing for disentanglement score on the validation dataset.