reproducibilityindex.ai

Hardness in Markov Decision Processes: Theory and Practice

Authors: Michelangelo Conserva, Paulo Rauber

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Third, we present an empirical analysis that provides new insights into computable measures. Finally, we benchmark five tabular agents in our newly proposed benchmark.
Researcher Affiliation	Academia	Michelangelo Conserva Queen Mary University of London London, United Kingdom m.conserva@qmul.ac.uk Paulo Rauber Queen Mary University of London London, United Kingdom p.rauber@qmul.ac.uk
Pseudocode	No	The paper describes methods and processes but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks or figures with structured, code-like steps.
Open Source Code	Yes	This section briefly introduces Colosseum, a pioneering Python package that bridges theory and practice in tabular reinforcement learning while also being applicable in the non-tabular setting. More details about the package can be found in Appendix A and in the project website.2 (Footnote 2: Available at https://michelangeloconserva.github.io/Colosseum.)
Open Datasets	Yes	Eight MDP families are available for experimentation. Some are traditional families (River Swim [24], Taxi [25], and Frozen Lake) while others are more recent (Mini Gid environments [26]). Additionally, Deep Sea [27] was included as a hard exploration family of problems, and the Simple Grid family is composed of simplified versions of the MG Empty environment.
Dataset Splits	No	The paper states: "The agents hyperparameters have been chosen by random search to minimize the average regret across MDPs with randomly sampled parameters (see Appendix E)." This describes a method for hyperparameter tuning, which serves a validation purpose. However, it does not specify explicit dataset splits (e.g., percentages or sample counts) for training, validation, or testing, as typically seen with static datasets in supervised learning. The context is reinforcement learning, where data is generated through interaction.
Hardware Specification	No	This research was financially supported by the Intelligent Games and Games Intelligence CDT (IGGI; EP/S022325/1) and used Queen Mary University of London Apocrita HPC facility. While a facility is named, no specific hardware details such as GPU models (e.g., NVIDIA A100), CPU models, or memory specifications are provided.
Software Dependencies	No	The paper mentions 'Colosseum, a pioneering Python package' and states that 'More details about the package can be found in Appendix A'. However, Appendix A is not provided in the given text, and therefore, specific software dependencies with version numbers (e.g., 'Python 3.8, PyTorch 1.9') are not explicitly listed within the available content.
Experiment Setup	Yes	We set the total number of time steps to 500 000 with a maximum training time of 10 minutes for the tabular setting and 40 minutes for the non-tabular setting. If an agent does not reach the maximum number of time steps before this time limit, learning is interrupted, and the agent continues interacting using its last best policy. This guarantees a fair comparison between agents with different computational costs. The performance indicators are computed every 100 time steps. Each interaction between an agent and an MDP is repeated for 20 seeds. The agents hyperparameters have been chosen by random search to minimize the average regret across MDPs with randomly sampled parameters (see Appendix E).