Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes
Authors: Pablo Samuel Castro10069-10076
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we provide empirical evidence for the effectiveness of our bisimulation approximants. We begin with a simple 31-state Grid World, on which we can compute the bisimulation metric exactly, and use a noisy representation which yields a continuous-state MDP. Having the exact metric for the 31-state MDP allows us to quantitatively measure the quality of our learned approximant. We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games. |
| Researcher Affiliation | Industry | Pablo Samuel Castro Google Brain psc@google.com |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code available at https://github.com/google-research/google-research/tree/master/bisimulation_aaai2020 |
| Open Datasets | Yes | We begin with a simple 31-state Grid World... We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games. ... Arcade Learning Environment (Bellemare et al. 2013). |
| Dataset Splits | No | The paper does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts). It describes the training process and how data is sampled from a replay buffer, but not a fixed partition. |
| Hardware Specification | Yes | Training was done on a Tesla P100 GPU. |
| Software Dependencies | No | The paper mentions "Adam optimizer (Kingma and Ba 2015)" but does not provide version numbers for any software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | We ran our experiments with γ = 0.99, C = 500, b = 256, and increased β from 0 to 1 by a factor of 0.9 every time the target network was updated; we used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 0.01. ... We used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 7.5e 5 (except for Pong where we found 0.001 yielded better results). |