Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Approximate Policy Iteration with Bisimulation Metrics
Authors: Mete Kemertas, Allan Douglas Jepson
TMLR 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs. |
| Researcher Affiliation | Collaboration | Mete Kemertas EMAIL Department of Computer Science, University of Toronto Vector Institute Allan Jepson EMAIL Department of Computer Science, University of Toronto Samsung AI Center, Toronto |
| Pseudocode | Yes | Algorithm 1 ϵ-aggregation for finite S |
| Open Source Code | No | Our source code will be open-sourced for reproducibility. |
| Open Datasets | No | In this section, we conduct experiments to empirically investigate the practical implications of the theoretical results in Sec. 3. To this end, we consider a discounted problem involving |S| states and m equivalence classes (ECs) denoted Bi where i [1, . . . , m] and each EC contains the same number of states |S|/m. At each step, an agent decides between two actions a0 and a1: taking a0 in Bi transitions the agent to Bi+1, while a1 transitions the agent to B1 (both with probability 1). Only when the agent takes a0 in Bm does it obtain a reward of 1 and taking a0 keeps the agent in Bm then. Hence, the agent is required to take a0 at least m times consecutively to collect rewards after taking a1 once. When constructing P, we sample uniformly from the (|S|/m 1)-simplex to determine the transition probabilities to states in each EC. |
| Dataset Splits | No | In the experiments outlined here, we choose |S| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). ... All figures present results over 10 seeds with shaded areas showing standard error. |
| Hardware Specification | Yes | Wall-clock time per step on an NVIDIA GeForce GTX 1080 GPU. |
| Software Dependencies | No | Our Sinkhorn-Knopp implementation on the Python Optimal Transport package (Flamary et al., 2021). |
| Experiment Setup | Yes | In the experiments outlined here, we choose |S| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). However, metric learning can be terminated early at a given step k if F(i+1) πk (d) F(i) πk (d) 10 3. To simulate noisy greedy updates, we perturb action probabilities of the ground-truth greedy policy with Gaussian noise and renormalize to form a distribution. A heuristic search of the noise variance ensures Tπg V TV [0.05, 0.1] so that δ 0.1. In all cases, we initialize π0 to be the maximum-entropy policy. |