Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Approximate Policy Iteration with Bisimulation Metrics

Authors: Mete Kemertas, Allan Douglas Jepson

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.
Researcher Affiliation	Collaboration	Mete Kemertas EMAIL Department of Computer Science, University of Toronto Vector Institute Allan Jepson EMAIL Department of Computer Science, University of Toronto Samsung AI Center, Toronto
Pseudocode	Yes	Algorithm 1 ϵ-aggregation for finite S
Open Source Code	No	Our source code will be open-sourced for reproducibility.
Open Datasets	No	In this section, we conduct experiments to empirically investigate the practical implications of the theoretical results in Sec. 3. To this end, we consider a discounted problem involving \|S\| states and m equivalence classes (ECs) denoted Bi where i [1, . . . , m] and each EC contains the same number of states \|S\|/m. At each step, an agent decides between two actions a0 and a1: taking a0 in Bi transitions the agent to Bi+1, while a1 transitions the agent to B1 (both with probability 1). Only when the agent takes a0 in Bm does it obtain a reward of 1 and taking a0 keeps the agent in Bm then. Hence, the agent is required to take a0 at least m times consecutively to collect rewards after taking a1 once. When constructing P, we sample uniformly from the (\|S\|/m 1)-simplex to determine the transition probabilities to states in each EC.
Dataset Splits	No	In the experiments outlined here, we choose \|S\| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). ... All figures present results over 10 seeds with shaded areas showing standard error.
Hardware Specification	Yes	Wall-clock time per step on an NVIDIA GeForce GTX 1080 GPU.
Software Dependencies	No	Our Sinkhorn-Knopp implementation on the Python Optimal Transport package (Flamary et al., 2021).
Experiment Setup	Yes	In the experiments outlined here, we choose \|S\| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). However, metric learning can be terminated early at a given step k if F(i+1) πk (d) F(i) πk (d) 10 3. To simulate noisy greedy updates, we perturb action probabilities of the ground-truth greedy policy with Gaussian noise and renormalize to form a distribution. A heuristic search of the noise variance ensures Tπg V TV [0.05, 0.1] so that δ 0.1. In all cases, we initialize π0 to be the maximum-entropy policy.