Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

From Black-box to Causal-box: Towards Building More Interpretable Models

Authors: Inwoo Hwang, Yushu Pan, Elias Bareinboim

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments corroborate the theoretical findings. In this section, we evaluate our framework for estimating counterfactuals and compare it with prior approaches. Experimental details and additional experimental results are provided in Appendix B. 4.1 Synthetic datasets We design the Bar MNIST dataset [17, 24] where the digits are colored and a bar appears at the top of the image, as shown in Fig. 5a. 4.2 Real-world datasets Celeb A dataset [19] contains human face images with the annotations on facial expressions and attributes, such as smiling , age , gender , etc.
Researcher Affiliation Academia Inwoo Hwang Yushu Pan Elias Bareinboim Causal Artificial Intelligence Lab Columbia University EMAIL EMAIL
Pseudocode No The paper includes mathematical definitions and theorems but does not contain any explicitly labeled pseudocode or algorithm blocks. It describes methodologies in prose and mathematical notation.
Open Source Code No For the implementation, we utilized publicly available code from Espinosa Zarlenga et al. [6]. (This indicates they used someone else's code, not released their own). Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: We provide the experimental details in Appendix B.
Open Datasets Yes We design the Bar MNIST dataset [17, 24] where the digits are colored and a bar appears at the top of the image, as shown in Fig. 5a. Celeb A dataset [19] contains human face images with the annotations on facial expressions and attributes, such as smiling , age , gender , etc.
Dataset Splits No For Bar MNIST, we generated 60,000 images and corresponding labels. For Celeb A, it contains 202,599 celebrity facial images. However, the paper does not specify how these datasets were split into training, validation, and test sets (e.g., percentages or sample counts for each split).
Hardware Specification Yes All experiments are conducted on a single NVIDIA A100 GPU.
Software Dependencies No In Bar MNIST, we used Res Net18 for the feature extractor. For the classifier, we used a three-layer MLP with the hidden dimension of 32 and leakyrelu activation. We used Adam optimizer with a learning rate of 0.0003. In Celeb A, we used Res Net34 for the feature extractor and used linear classifier. We used SGD optimizer with the learning rate of 0.001. While specific models and optimizers are mentioned, no specific software library versions (e.g., PyTorch version, Python version) are provided.
Experiment Setup Yes We set the batch size to 1024 and trained the models for 100 epoch. We used Adam optimizer with a learning rate of 0.0003. In Celeb A, we set the batch size to 512 and trained the models for 100 epochs. We used SGD optimizer with the learning rate of 0.001. We resized the image with center crop into 64 64 for training.