reproducibilityindex.ai

On the Expressivity of Markov Reward

Authors: David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael Littman, Doina Precup, Satinder Singh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conclude with an empirical study that corroborates and illustrates our theoretical ﬁndings. Our focus is on SOAPs, though we anticipate the insights extend to POs and TOs as well with little complication. In the ﬁrst experiment, we study the fraction of SOAPs that are expressible in small CMPs as we vary aspects of the environment or task (Figure 3). In the second, we use one algorithm from Theorem 4.3 to design a reward function, and contrast learning curves under a SOAP-designed reward function compared to standard rewards. Full details about the experiments are found in Appendix C.
Researcher Affiliation	Collaboration	David Abel Deep Mind dmabel@deepmind.com Will Dabney Deep Mind wdabney@deepmind.com Anna Harutyunyan Deep Mind harutyunyan@deepmind.com Mark K. Ho Department of Computer Science Princeton University mho@princeton.edu Michael L. Littman Department of Computer Science Brown University mlittman@cs.brown.edu Doina Precup Deep Mind doinap@deepmind.com Satinder Singh Deep Mind baveja@deepmind.com
Pseudocode	Yes	Algorithm 1 SOAP Reward Design
Open Source Code	No	The paper does not provide any concrete access information (link, explicit statement of release) to open-source code for the methodology described.
Open Datasets	No	The paper describes using a 'Russell and Norvig [42] grid world' and 'small CMPs' (Controlled Markov Processes) for experiments, which are environments or generated scenarios rather than traditional publicly available datasets with a fixed download source. No specific link, DOI, repository, or formal citation for a public dataset is provided.
Dataset Splits	No	The paper does not specify dataset split percentages or sample counts for training, validation, or test sets. It describes simulating environments and running Q-learning, but not in terms of predefined data splits.
Hardware Specification	No	The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory, cloud instances) used for running its experiments. It only discusses the experimental setup at a conceptual level.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup	No	The paper describes aspects of the experimental setup such as the '0.35 slip probability' and reward values in the grid world, and that 'Each CMP has four states and three actions, with a ﬁxed but randomly chosen transition function'. However, it does not provide specific hyperparameter values like learning rates, batch sizes, or optimizer settings, which are typically found in a detailed experimental setup.