Near-Minimax-Optimal Distributional Reinforcement Learning with a Generative Model

Authors: Mark Rowland, Kevin Li, Remi Munos, Clare Lyle, Yunhao Tang, Will Dabney

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we provide an experimental study comparing a variety of model-based distributional RL algorithms, with several key takeaways for practitioners.
Researcher Affiliation Industry Mark Rowland Google Deep Mind Li Kevin Wenliang Google Deep Mind Rémi Munos FAIR, Meta Clare Lyle Google Deep Mind Yunhao Tang Google Deep Mind Will Dabney Google Deep Mind
Pseudocode Yes Algorithm 1: The direct categorical fixed-point algorithm (DCFP).
Open Source Code No The paper provides code snippets in Appendix G.5, which are marked with a copyright and license, but does not provide a direct link to an external repository for the full methodology described in the paper.
Open Datasets No The paper describes using a 'generative model' to obtain N i.i.d. samples for each state, which is a method of data generation rather than the use of a pre-existing, publicly available dataset.
Dataset Splits No The paper does not explicitly mention training/validation/test splits as it operates with a generative model for data sampling and evaluates on constructed environments.
Hardware Specification No All experiments were run using the Python 3 language (Van Rossum and Drake, 2009)... As all experiments are tabular, each run uses a single CPU, and timings are reported within the experimental results. The paper does not specify a particular CPU model or other specific hardware details.
Software Dependencies No All experiments were run using the Python 3 language (Van Rossum and Drake, 2009), and made use of Num Py (Harris et al., 2020), Sci Py (Virtanen et al., 2020), Matplotlib (Hunter, 2007), and Seaborn (Waskom, 2021) libraries. The paper provides publication years for the libraries but not explicit version numbers (e.g., NumPy 1.x or SciPy 1.x).
Experiment Setup Yes For each setting, we repeat the experiment 30 times with different sampled transitions. We display trade-off plots... for the cases of m {30, 100, 300, 1000} atoms and using N = 10^6 sample transitions from each state to estimate transition matrix. We ran all DP methods with 30,000 iterations.