Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Model Selection for Off-policy Evaluation: New Algorithms and Experimental Protocol

Authors: Pai Liu, Lingfeng Zhao, Shivangi Agarwal, Jinghan Liu, Audrey Huang, Philip Amortila, Nan Jiang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Preliminary experiments (Section 6): We instantiate the protocol in Gym Hopper [9] and demonstrate the various ways in which we can evaluate and understand different selectors.
Researcher Affiliation	Academia	Pai Liu UIUC Lingfeng Zhao Columbia University Shivangi Agarwal IIIT Delhi Jinghan Liu USTC Audrey Huang UIUC Philip Amortila UIUC Nan Jiang UIUC
Pseudocode	No	The paper describes algorithms such as LSTD-Tournament in Section 3.2 and model-based selectors in Section 4 and Appendix B, but these are presented in descriptive text with mathematical formulations rather than structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/Coder-PAI/2025_neurips_model_selection_rl.git.
Open Datasets	Yes	We instantiate the protocol in Gym Hopper [9] and demonstrate the various ways in which we can evaluate and understand different selectors. Our experiments will be based on the Hopper-v4 environment [9].
Dataset Splits	Yes	A dataset is collected by sampling trajectories until n = 3200 transition tuples are obtained. To account for the randomness of D, we use bootstrapping to sample (with replacement) multiple datasets, and report an algorithm s mean performance across these bootstrapped samples with 95% confidence intervals. In Figure 5R, we take a previous experiment setting (Mg) and isolate a particular target policy π; then, we create two datasets: (1) Dπ sampled using π; (2) Doff sampled using a policy that is created to be very different from the target policies and offer very little coverage (see Appendix D.2). Then, we run the algorithm with λ fraction of data from Dπ combined with (1 λ) from Doff;
Hardware Specification	Yes	The main experiment on MF.G, took nearly a week on a 4090 PC.
Software Dependencies	No	The paper mentions "DDPG [37]" and "Mujoco engine" [13] but does not provide specific version numbers for these or other software dependencies crucial for replication.
Experiment Setup	Yes	Our experiments will be based on the Hopper-v4 environment [9]. To create a variety of environments, we add different levels of stochastic noise in the transitions and change the gravity constant (see Appendix D.1). For horizon, we set H = 1024 which is substantially longer than typically observed trajectories from the target policies. For l, we plot the convergence of JM(π) estimation and choose l = 128 accordingly (see Figure 2R). A dataset is collected by sampling trajectories until n = 3200 transition tuples are obtained.