Foundations of Multivariate Distributional Reinforcement Learning
Authors: Harley Wiltzer, Jesse Farebrother, Arthur Gretton, Mark Rowland
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice. and 6.1 Simulations: Distributional Successor Features |
| Researcher Affiliation | Collaboration | Harley Wiltzer Mila Québec AI Institute Mc Gill University harley.wiltzer@mail.mcgill.ca Jesse Farebrother Mila Québec AI Institute Mc Gill Unversity jfarebro@cs.mcgill.ca Arthur Gretton Google Deep Mind Gatsby Unit, University College London gretton@google.com Mark Rowland Google Deep Mind markrowland@google.com |
| Pseudocode | Yes | Algorithm 1 Projected Categorical Dynamic Programming |
| Open Source Code | No | The NeurIPS Paper Checklist states 'Code will be provided.', which is a future promise, not a current release of the code for the work described in the paper. |
| Open Datasets | No | The paper describes using '100 random MDPs, with transitions drawn from Dirichlet priors and 2-dimensional cumulants drawn from uniform priors.' This indicates custom-generated data rather than a specific, named, publicly available dataset with a concrete access link or formal citation. |
| Dataset Splits | No | The paper does not explicitly provide details about training/test/validation dataset splits, nor does it reference predefined splits or cross-validation setups for the MDP data used in experiments. |
| Hardware Specification | Yes | TD-learning experiments were conducted on a NVidia A100 80G GPU to parallelize experiments. |
| Software Dependencies | No | The paper mentions software like 'Jax [BFH+18]' and 'Jax Opt [BBC+21]' and the 'Julia programming language [BEKS17]', but it does not provide specific version numbers for these software components (e.g., 'Jax 0.x' or 'Julia 1.x'). |
| Experiment Setup | Yes | SGD was used for optimization, using an annealed learning rate schedule (λk)k 0 with λk = k 3/5, satisfying the conditions of Lemma 10. |