Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Scalable Methods for Computing State Similarity in Deterministic Markov Decision Processes
Authors: Pablo Samuel Castro10069-10076
AAAI 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section we provide empirical evidence for the effectiveness of our bisimulation approximants. We begin with a simple 31-state Grid World, on which we can compute the bisimulation metric exactly, and use a noisy representation which yields a continuous-state MDP. Having the exact metric for the 31-state MDP allows us to quantitatively measure the quality of our learned approximant. We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games. |
| Researcher Affiliation | Industry | Pablo Samuel Castro Google Brain EMAIL |
| Pseudocode | No | No explicit pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | Code available at https://github.com/google-research/google-research/tree/master/bisimulation_aaai2020 |
| Open Datasets | Yes | We begin with a simple 31-state Grid World... We then learn a π-bisimulation approximant over policies generated by reinforcement learning agents trained on Atari 2600 games. ... Arcade Learning Environment (Bellemare et al. 2013). |
| Dataset Splits | No | The paper does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts). It describes the training process and how data is sampled from a replay buffer, but not a fixed partition. |
| Hardware Specification | Yes | Training was done on a Tesla P100 GPU. |
| Software Dependencies | No | The paper mentions "Adam optimizer (Kingma and Ba 2015)" but does not provide version numbers for any software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | We ran our experiments with γ = 0.99, C = 500, b = 256, and increased β from 0 to 1 by a factor of 0.9 every time the target network was updated; we used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 0.01. ... We used the Adam optimizer (Kingma and Ba 2015) with a learning rate of 7.5e 5 (except for Pong where we found 0.001 yielded better results). |