Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

STITCH-OPE: Trajectory Stitching with Guided Diffusion for Off-Policy Evaluation

Authors: Hossein Goli, Michael Gimelfarb, Nathan de Lara, Haruki Nishimura, Masha Itkina, Florian Shkurti

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the D4RL and Open AI Gym benchmarks show substantial improvement in mean squared error, correlation, and regret metrics compared to state-of-the-art OPE methods. Our empirical evaluation aims to answer the following research questions: 1. Does the combination of conditional diffusion and negative guidance (as hypothesized in Table 1) translate to robust OPE performance on standard benchmarks? 2. Is STITCH-OPE robust across problem size (e.g., state/action dimension, horizon)? 3. Is STITCH-OPE robust across different levels of optimality of the target policy and the classes of policies?
Researcher Affiliation	Collaboration	1Department of Computer Science, University of Toronto 2University of Toronto Robotics Institute, Toronto, Canada 3Toyota Research Institute, Los Altos, California 4Vector Institute, Toronto, Canada
Pseudocode	Yes	A high-level pseudocode of conditional diffusion model training in STITCH-OPE is provided as Algorithm 1. A pseudocode of the off-policy evaluation subroutine for a single rollout is provided as Algorithm 2.
Open Source Code	Yes	Project website and code: stitch-ope.github.io. Anonymized code is included in the zip file as part of the supplementary material, along with instructions to run the code in a readme file.
Open Datasets	Yes	We evaluate the performance of STITCH-OPE in high-dimensional long-horizon tasks using the standard D4RL benchmark [13] and their respective benchmark policies [14]. Specifically, we use the halfcheetah-medium, hopper-medium and walker2d-medium behavior datasets. We also carry out similar experiments using classical control tasks (Pendulum and Acrobot) from Open AI Gym [4].
Dataset Splits	No	The paper mentions using 'medium datasets from the D4RL offline suite [13]' and 'classical control tasks (Pendulum and Acrobot) from Open AI Gym [4]'. It also mentions 'Each evaluation consists of 10 target policies...trained at varying levels of ability'. However, it does not explicitly describe how these datasets were split into training, validation, or test sets for the diffusion model training itself; rather, it implies training on the provided behavior data and evaluating different policies.
Hardware Specification	Yes	Hardware and Software. All experiments were conducted on a local workstation running Ubuntu 20.04 LTS and Python 3.9, with the following hardware: 2 NVIDIA RTX 3090 GPUs (24 GB each) Intel(R) Core(TM) i9-9820X CPU @ 3.30GHz (10 cores / 20 threads) 128 GB RAM.
Software Dependencies	No	The paper states 'Ubuntu 20.04 LTS and Python 3.9' but does not list specific version numbers for key software libraries or frameworks (e.g., PyTorch, TensorFlow, NumPy) that would be critical for reproducibility of deep learning experiments.
Experiment Setup	Yes	STITCH-OPE Training and Hyper-Parameter Details. The list of training hyper-parameters for the trajectory diffusion model is provided in Table 11. Description Value diffusion architecture UNet denoising time steps 256 learning rate of Adam optimizer 0.0003 training epochs (passes over the data set) 150 batch size 128 training steps per epoch 5000 (D4RL), 2000 (Gym) guidance coefficient for π, i.e. α 0.5 (D4RL), 0.1 (Gym) guidance coefficient ratio for β , i.e. λ / α 0.5 (D4RL), 1 (Gym) window size of sub-trajectories, i.e. w 8 (D4RL), 16 (Gym)