Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Transferring Causal Effects using Proxies

Authors: Manuel Iglesias-Alonso, Felix Schur, Julius von Kügelgen, Jonas Peters

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The theoretical results are supported by simulation studies and a real-world example studying the causal effect of website rankings on consumer choices.
Researcher Affiliation	Academia	Manuel Iglesias-Alonso ETH Zürich Felix Schur Seminar for Statistics ETH Zürich Julius von Kügelgen Seminar for Statistics ETH Zürich Jonas Peters Seminar for Statistics ETH Zürich
Pseudocode	No	The paper describes algorithms in paragraph form (e.g., 'The algorithm has the following steps:' in C.1 Data generation) but does not provide a clearly structured pseudocode or algorithm block with specific labels.
Open Source Code	Yes	Code implementing both estimators, along with the experiments, is available on https://github.com/manueligal/proxy-intervention.
Open Datasets	Yes	We consider the Expedia Hotel Searches dataset [50], in which each observation corresponds to a query made by a user of Expedia s webpage. It contains information about the user, the search filters, the hotels displayed, and whether the user clicked on or booked any of the shown accommodations. ... Adam Woznica, Ben Hamner, Dan Friedman, and SSA_Expedia. Personalize Expedia Hotel Searches ICDM 2013. https://kaggle.com/competitions/expedia-personalized-sort, 2013. Kaggle.
Dataset Splits	No	As source domains, we use those hotels that have at least 2000 observations (except for the ones chosen as target domains, see below); this results in 25 hotels. ... As target domains, we choose 18 hotels among those hotels that appear in at least 1500 different queries in the randomized dataset. This ensures that we can obtain reasonable estimates using the oracle, which we consider the ground truth in this experiment. In total, we obtain ca. 64 000 and 50 000 observations from 8400 and 4000 queries for the source and target domains, respectively. While this describes the division of data into source and target domains and observation counts, it does not specify traditional training, testing, or validation splits with percentages or specific sample counts for model evaluation.
Hardware Specification	No	The paper mentions a 'runtime analysis' in Appx. C.5 and computation time but does not specify any particular hardware like CPU or GPU models used for the experiments.
Software Dependencies	No	The optimization is done using the default optim function in R with the L-BFGS-B algorithm [55]. While 'R' and 'L-BFGS-B algorithm' are mentioned, specific version numbers for these software components are not provided.
Experiment Setup	Yes	The optimization is done using the default optim function in R with the L-BFGS-B algorithm [55]. The initial value for each component of θ is generated from a continuous uniform distribution in [0, 1] and we consider a maximum of 50 000 iterations. ... The hyperparameters chosen in the simulation experiments are included together with the results in 5 and the appendices. ... The plot presented in Fig. 3 uses a sample size of n = 20 000. We repeat the same procedure but using the sample size n = 1000 in Fig. 7 and n = 100 000 in Fig. 8. ... In Fig. 4, we use the parameters k E = 3, n = 20 000, M = 10, and N = 5. ... All the confidence intervals are at level 0.95.