Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Identifying biological perturbation targets through causal differential networks

Authors: Menghua Wu, Umesh Padia, Sean H. Murphy, Regina Barzilay, Tommi Jaakkola

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate CDN on real transcriptomic data and synthetic settings. CDN outperforms the state-of-the-art in perturbation modeling (deep learning and statistical approaches), evaluated on the five largest Perturb-seq datasets at the time of publication (Replogle et al., 2022; Nadig et al., 2024) without using any external knowledge. Furthermore, CDN generalizes with minimal performance drop to unseen cell lines, which have different supports (genes), causal mechanisms (gene regulatory networks), and data distributions. On synthetic settings, CDN outperforms causal discovery approaches for estimating unknown intervention targets.
Researcher Affiliation	Academia	1Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA. Correspondence to: Menghua Wu <EMAIL>.
Pseudocode	No	The paper describes the model architecture in Section 3.1 and Appendix A, including equations and descriptions of layers. However, it does not present any structured pseudocode or algorithm blocks with numbered steps.
Open Source Code	Yes	1Code is available at https://github.com/rmwu/cdn
Open Datasets	Yes	We validate CDN on five Perturb-seq (Dixit et al., 2016) datasets (genetic perturbations) from Replogle et al. (2022) and Nadig et al. (2024); as well as two Sci-Plex (Srivatsan et al., 2020) datasets (chemical perturbations) from Mc Faline-Figueroa et al. (2024). Each dataset is a real-valued matrix of gene expression levels: the number of examples M is the number of cells, the number of variables N is the number of genes, and each entry is a log-normalized count of how many copies of gene j was measured from cell i. Table 4: Extended biological dataset statistics (raw). Type Source Accession Cell line # Perts # Genes # NTCs # Cells Replogle et al. (2022) Figshare 20029387 K562 gw 9,866 8,248 75,328 1,989,578 Nadig et al. (2024) GSE220095 Hep G2 2,393 9,624 4,976 145,473 Chemical Mc Faline-Figueroa et al. (2024) GSM7056151 A172 23 8,393 8,660 58,347
Dataset Splits	Yes	We consider two splits: seen and unseen cell lines. In the former, models may be trained on approximately half of the perturbations from each cell line, and are evaluated on the unseen perturbations. In the latter, we hold out one cell line at a time, and models may be trained on data from the remaining cell lines. To ensure that our train and test splits are sufficiently distinct, we cluster perturbations based on their log-fold change and assign each cluster to the same split (Figure 5). Table 5: Extended biological dataset statistics (processed). Perturbations Genes Cells Dataset Train Test Trivial Non-trivial Unique # DE Median # DE K562 gw 1089 678 587 91 7,378 81 492,096
Hardware Specification	Yes	During training, we used 15 CPU workers (primarily for local graph estimates) and 1 A6000 GPU. All models run on a single A6000 GPU, no constraint on memory (up to 500G).
Software Dependencies	No	The paper mentions using specific algorithms like 'Adam W optimizer (Loshchilov & Hutter, 2019)' and 'FCI algorithm (Spirtes et al., 1995)', and libraries like 'scikit-learn (Pedregosa et al., 2011)' and 'scanpy package (Wolf et al., 2018)'. While these tools are identified, no specific version numbers (e.g., PyTorch 1.9, Python 3.8) for the authors' core implementation are provided. The statement 'We used the latest releases of all baselines.' refers to third-party tools, not the authors' own software dependencies with specific version numbers.
Experiment Setup	Yes	We swept over the number of differential network layers (Figure 4) on synthetic data, and we used 3 layers for hcat and 2 layers for hdiff. Following SEA, we adopted hidden dimension d = 64, the Adam W optimizer (Loshchilov & Hutter, 2019), learning rate 1e-4, batch size 16, and weight decay 1e-5. On the real data, where N = 1000, we changed to a batch size of 1, decreased the learning rate to 5e-6, and finetuned the models with half precision (FP16).