reproducibilityindex.ai

Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes

Authors: Pablo Samuel Castro10069-10076

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section we provide empirical evidence for the effectiveness of our bisimulation approximants. We begin with a simple 31-state Grid World, on which we can compute the bisimulation metric exactly, and use a noisy representation which yields a continuous-state MDP. Having the exact metric for the 31-state MDP allows us to quantitatively measure the quality of our learned approximant. We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games.
Researcher Affiliation	Industry	Pablo Samuel Castro Google Brain psc@google.com
Pseudocode	No	No explicit pseudocode or algorithm blocks were found in the paper.
Open Source Code	Yes	Code available at https://github.com/google-research/google-research/tree/master/bisimulation_aaai2020
Open Datasets	Yes	We begin with a simple 31-state Grid World... We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games. ... Arcade Learning Environment (Bellemare et al. 2013).
Dataset Splits	No	The paper does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts). It describes the training process and how data is sampled from a replay buffer, but not a fixed partition.
Hardware Specification	Yes	Training was done on a Tesla P100 GPU.
Software Dependencies	No	The paper mentions "Adam optimizer (Kingma and Ba 2015)" but does not provide version numbers for any software dependencies or libraries used for implementation.
Experiment Setup	Yes	We ran our experiments with γ = 0.99, C = 500, b = 256, and increased β from 0 to 1 by a factor of 0.9 every time the target network was updated; we used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 0.01. ... We used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 7.5e 5 (except for Pong where we found 0.001 yielded better results).