Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Approximate Policy Iteration with Bisimulation Metrics

Authors: Mete Kemertas, Allan Douglas Jepson

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, we validate our theoretical results and investigate their practical implications via a controlled empirical analysis based on an implementation of bisimulation-based API for finite MDPs.
Researcher Affiliation Collaboration Mete Kemertas EMAIL Department of Computer Science, University of Toronto Vector Institute Allan Jepson EMAIL Department of Computer Science, University of Toronto Samsung AI Center, Toronto
Pseudocode Yes Algorithm 1 ϵ-aggregation for finite S
Open Source Code No Our source code will be open-sourced for reproducibility.
Open Datasets No In this section, we conduct experiments to empirically investigate the practical implications of the theoretical results in Sec. 3. To this end, we consider a discounted problem involving |S| states and m equivalence classes (ECs) denoted Bi where i [1, . . . , m] and each EC contains the same number of states |S|/m. At each step, an agent decides between two actions a0 and a1: taking a0 in Bi transitions the agent to Bi+1, while a1 transitions the agent to B1 (both with probability 1). Only when the agent takes a0 in Bm does it obtain a reward of 1 and taking a0 keeps the agent in Bm then. Hence, the agent is required to take a0 at least m times consecutively to collect rewards after taking a1 once. When constructing P, we sample uniformly from the (|S|/m 1)-simplex to determine the transition probabilities to states in each EC.
Dataset Splits No In the experiments outlined here, we choose |S| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). ... All figures present results over 10 seeds with shaded areas showing standard error.
Hardware Specification Yes Wall-clock time per step on an NVIDIA GeForce GTX 1080 GPU.
Software Dependencies No Our Sinkhorn-Knopp implementation on the Python Optimal Transport package (Flamary et al., 2021).
Experiment Setup Yes In the experiments outlined here, we choose |S| = 200, m = 20 and γ = 0.9. Unless otherwise stated, we use use p = λ = 1 and n = log( 1 γ 1+γ )/ log(γ) as in Thm. 3.8 (n = 28 for γ = 0.9). However, metric learning can be terminated early at a given step k if F(i+1) πk (d) F(i) πk (d) 10 3. To simulate noisy greedy updates, we perturb action probabilities of the ground-truth greedy policy with Gaussian noise and renormalize to form a distribution. A heuristic search of the noise variance ensures Tπg V TV [0.05, 0.1] so that δ 0.1. In all cases, we initialize π0 to be the maximum-entropy policy.