On the Expressivity of Markov Reward

Authors: David Abel, Will Dabney, Anna Harutyunyan, Mark K. Ho, Michael Littman, Doina Precup, Satinder Singh

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conclude with an empirical study that corroborates and illustrates our theoretical findings. Our focus is on SOAPs, though we anticipate the insights extend to POs and TOs as well with little complication. In the first experiment, we study the fraction of SOAPs that are expressible in small CMPs as we vary aspects of the environment or task (Figure 3). In the second, we use one algorithm from Theorem 4.3 to design a reward function, and contrast learning curves under a SOAP-designed reward function compared to standard rewards. Full details about the experiments are found in Appendix C.
Researcher Affiliation Collaboration David Abel Deep Mind dmabel@deepmind.com Will Dabney Deep Mind wdabney@deepmind.com Anna Harutyunyan Deep Mind harutyunyan@deepmind.com Mark K. Ho Department of Computer Science Princeton University mho@princeton.edu Michael L. Littman Department of Computer Science Brown University mlittman@cs.brown.edu Doina Precup Deep Mind doinap@deepmind.com Satinder Singh Deep Mind baveja@deepmind.com
Pseudocode Yes Algorithm 1 SOAP Reward Design
Open Source Code No The paper does not provide any concrete access information (link, explicit statement of release) to open-source code for the methodology described.
Open Datasets No The paper describes using a 'Russell and Norvig [42] grid world' and 'small CMPs' (Controlled Markov Processes) for experiments, which are environments or generated scenarios rather than traditional publicly available datasets with a fixed download source. No specific link, DOI, repository, or formal citation for a public dataset is provided.
Dataset Splits No The paper does not specify dataset split percentages or sample counts for training, validation, or test sets. It describes simulating environments and running Q-learning, but not in terms of predefined data splits.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU/CPU models, memory, cloud instances) used for running its experiments. It only discusses the experimental setup at a conceptual level.
Software Dependencies No The paper does not provide specific ancillary software details, such as library names with version numbers (e.g., Python 3.8, PyTorch 1.9).
Experiment Setup No The paper describes aspects of the experimental setup such as the '0.35 slip probability' and reward values in the grid world, and that 'Each CMP has four states and three actions, with a fixed but randomly chosen transition function'. However, it does not provide specific hyperparameter values like learning rates, batch sizes, or optimizer settings, which are typically found in a detailed experimental setup.