Hardness in Markov Decision Processes: Theory and Practice
Authors: Michelangelo Conserva, Paulo Rauber
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Third, we present an empirical analysis that provides new insights into computable measures. Finally, we benchmark five tabular agents in our newly proposed benchmark. |
| Researcher Affiliation | Academia | Michelangelo Conserva Queen Mary University of London London, United Kingdom m.conserva@qmul.ac.uk Paulo Rauber Queen Mary University of London London, United Kingdom p.rauber@qmul.ac.uk |
| Pseudocode | No | The paper describes methods and processes but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures with structured, code-like steps. |
| Open Source Code | Yes | This section briefly introduces Colosseum, a pioneering Python package that bridges theory and practice in tabular reinforcement learning while also being applicable in the non-tabular setting. More details about the package can be found in Appendix A and in the project website.2 (Footnote 2: Available at https://michelangeloconserva.github.io/Colosseum.) |
| Open Datasets | Yes | Eight MDP families are available for experimentation. Some are traditional families (River Swim [24], Taxi [25], and Frozen Lake) while others are more recent (Mini Gid environments [26]). Additionally, Deep Sea [27] was included as a hard exploration family of problems, and the Simple Grid family is composed of simplified versions of the MG Empty environment. |
| Dataset Splits | No | The paper states: "The agents hyperparameters have been chosen by random search to minimize the average regret across MDPs with randomly sampled parameters (see Appendix E)." This describes a method for hyperparameter tuning, which serves a validation purpose. However, it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing, as typically seen with static datasets in supervised learning. The context is reinforcement learning, where data is generated through interaction. |
| Hardware Specification | No | This research was financially supported by the Intelligent Games and Games Intelligence CDT (IGGI; EP/S022325/1) and used Queen Mary University of London Apocrita HPC facility. While a facility is named, no specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory specifications are provided. |
| Software Dependencies | No | The paper mentions 'Colosseum, a pioneering Python package' and states that 'More details about the package can be found in Appendix A'. However, Appendix A is not provided in the given text, and therefore, specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9') are not explicitly listed within the available content. |
| Experiment Setup | Yes | We set the total number of time steps to 500 000 with a maximum training time of 10 minutes for the tabular setting and 40 minutes for the non-tabular setting. If an agent does not reach the maximum number of time steps before this time limit, learning is interrupted, and the agent continues interacting using its last best policy. This guarantees a fair comparison between agents with different computational costs. The performance indicators are computed every 100 time steps. Each interaction between an agent and an MDP is repeated for 20 seeds. The agents hyperparameters have been chosen by random search to minimize the average regret across MDPs with randomly sampled parameters (see Appendix E). |