Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Approximate Policy Iteration with Bisimulation Metrics
Authors: Mete Kemertas, Allan Douglas Jepson
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs. |
| Researcher Affiliation | Collaboration | Mete Kemertas EMAIL Department of Computer Science, University of Toronto Vector Institute Allan Jepson EMAIL Department of Computer Science, University of Toronto Samsung AI Center, Toronto |
| Pseudocode | Yes | Algorithm 1 ϵ-aggregation for finite S |
| Open Source Code | No | Our source code will be open-sourced for reproducibility. |
| Open Datasets | No | In this section, we conduct experiments to empirically investigate the practical implications of the theoretical results in Sec. 3. To this end, we consider a discounted problem involving |S| states and m equivalence classes (ECs) denoted Bi where i [1, . . . , m] and each EC contains the same number of states |S|/m. At each step, an agent decides between two actions a0 and a1: taking a0 in Bi transitions the agent to Bi+1, while a1 transitions the agent to B1 (both with probability 1). Only when the agent takes a0 in Bm does it obtain a reward of 1 and taking a0 keeps the agent in Bm then. Hence, the agent is required to take a0 at least m times consecutively to collect rewards after taking a1 once. When constructing P, we sample uniformly from the (|S|/m 1)-simplex to determine the transition probabilities to states in each EC. |
| Dataset Splits | No | In the experiments outlined here, we choose |S| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). ... All figures present results over 10 seeds with shaded areas showing standard error. |
| Hardware Specification | Yes | Wall-clock time per step on an NVIDIA GeForce GTX 1080 GPU. |
| Software Dependencies | No | Our Sinkhorn-Knopp implementation on the Python Optimal Transport package (Flamary et al., 2021). |
| Experiment Setup | Yes | In the experiments outlined here, we choose |S| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). However, metric learning can be terminated early at a given step k if F(i+1) πk (d) F(i) πk (d) 10 3. To simulate noisy greedy updates, we perturb action probabilities of the ground-truth greedy policy with Gaussian noise and renormalize to form a distribution. A heuristic search of the noise variance ensures Tπg V TV [0.05, 0.1] so that δ 0.1. In all cases, we initialize π0 to be the maximum-entropy policy. |