Foundations of Multivariate Distributional Reinforcement Learning

Authors: Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Mark Rowland

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice. and 6.1 Simulations: Distributional Successor Features
Researcher Affiliation Collaboration Harley Wiltzer Mila Québec AI Institute Mc Gill University harley.wiltzer@mail.mcgill.ca Jesse Farebrother Mila Québec AI Institute Mc Gill Unversity jfarebro@cs.mcgill.ca Arthur Gretton Google Deep Mind Gatsby Unit, University College London gretton@google.com Mark Rowland Google Deep Mind markrowland@google.com
Pseudocode Yes Algorithm 1 Projected Categorical Dynamic Programming
Open Source Code No The NeurIPS Paper Checklist states 'Code will be provided.', which is a future promise, not a current release of the code for the work described in the paper.
Open Datasets No The paper describes using '100 random MDPs, with transitions drawn from Dirichlet priors and 2-dimensional cumulants drawn from uniform priors.' This indicates custom-generated data rather than a specific, named, publicly available dataset with a concrete access link or formal citation.
Dataset Splits No The paper does not explicitly provide details about training/test/validation dataset splits, nor does it reference predefined splits or cross-validation setups for the MDP data used in experiments.
Hardware Specification Yes TD-learning experiments were conducted on a NVidia A100 80G GPU to parallelize experiments.
Software Dependencies No The paper mentions software like 'Jax [BFH+18]' and 'Jax Opt [BBC+21]' and the 'Julia programming language [BEKS17]', but it does not provide specific version numbers for these software components (e.g., 'Jax 0.x' or 'Julia 1.x').
Experiment Setup Yes SGD was used for optimization, using an annealed learning rate schedule (λk)k 0 with λk = k 3/5, satisfying the conditions of Lemma 10.