Stochastically Dominant Distributional Reinforcement Learning
Authors: John Martin, Michal Lyskawinski, Xiaohu Li, Brendan Englot
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments characterize the algorithm s performance and demonstrate how uncertainty and performance are better balanced using an SSD policy than with other risk measures. We validate our theoretical claims with several targeted experiments. The main hypothesis we test is that the SSD policy induces the least-disperse data distribution from which optimality can be achieved when learning off-policy. |
| Researcher Affiliation | Academia | John D. Martin 1 Michal Lyskawinski 1 Xiaohu Li 1 Brendan Englot 1 1Stevens Institute of Technology, Hoboken, New Jersey, USA. Correspondence to: John D. Martin <jmarti3@stevens.edu>. |
| Pseudocode | Yes | Algorithm 1 Online WGF Fitted Q-iteration; Algorithm 2 Proximal Loss |
| Open Source Code | No | The paper does not provide an explicit statement about releasing code or a link to a code repository. |
| Open Datasets | Yes | We revisit the Cliff Walk domain with a modified reward structure (See appendix). We used fixed Monte Carlo (MC) targets from the optimal greedy policy. We use off-policy updates with bootstrapped targets and compare performance results with an agent trained using the QR loss (Dabney et al., 2017) on three common control tasks from the Open AI Gym (Brockman et al., 2016): Mountain Car, Cart Pole, and Lunar Lander. |
| Dataset Splits | No | The paper does not explicitly provide specific percentages, counts, or a detailed methodology for creating train/validation/test dataset splits. It mentions using 'fixed Monte Carlo (MC) targets' and 'Open AI Gym' tasks, which implies predefined environments, but not explicit data splits for replication. |
| Hardware Specification | No | The paper does not specify any particular hardware used for running the experiments, such as GPU/CPU models or cloud instance types. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments. |
| Experiment Setup | No | We parameterize return distributions with a two-layer fully-connected neural network of 256 hidden units. We use off-policy updates with bootstrapped targets and compare performance results with an agent trained using the QR loss (Dabney et al., 2017) on three common control tasks from the Open AI Gym (Brockman et al., 2016): Mountain Car, Cart Pole, and Lunar Lander. This describes the model architecture and general update strategy, but it does not include specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training schedules. |