Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes

Authors: Pablo Samuel Castro10069-10076

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we provide empirical evidence for the effectiveness of our bisimulation approximants. We begin with a simple 31-state Grid World, on which we can compute the bisimulation metric exactly, and use a noisy representation which yields a continuous-state MDP. Having the exact metric for the 31-state MDP allows us to quantitatively measure the quality of our learned approximant. We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games.
Researcher Affiliation Industry Pablo Samuel Castro Google Brain psc@google.com
Pseudocode No No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code Yes Code available at https://github.com/google-research/google-research/tree/master/bisimulation_aaai2020
Open Datasets Yes We begin with a simple 31-state Grid World... We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games. ... Arcade Learning Environment (Bellemare et al. 2013).
Dataset Splits No The paper does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts). It describes the training process and how data is sampled from a replay buffer, but not a fixed partition.
Hardware Specification Yes Training was done on a Tesla P100 GPU.
Software Dependencies No The paper mentions "Adam optimizer (Kingma and Ba 2015)" but does not provide version numbers for any software dependencies or libraries used for implementation.
Experiment Setup Yes We ran our experiments with γ = 0.99, C = 500, b = 256, and increased β from 0 to 1 by a factor of 0.9 every time the target network was updated; we used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 0.01. ... We used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 7.5e 5 (except for Pong where we found 0.001 yielded better results).