Stochastically Dominant Distributional Reinforcement Learning

Authors: John Martin, Michal Lyskawinski, Xiaohu Li, Brendan Englot

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments characterize the algorithm s performance and demonstrate how uncertainty and performance are better balanced using an SSD policy than with other risk measures. We validate our theoretical claims with several targeted experiments. The main hypothesis we test is that the SSD policy induces the least-disperse data distribution from which optimality can be achieved when learning off-policy.
Researcher Affiliation Academia John D. Martin 1 Michal Lyskawinski 1 Xiaohu Li 1 Brendan Englot 1 1Stevens Institute of Technology, Hoboken, New Jersey, USA. Correspondence to: John D. Martin <jmarti3@stevens.edu>.
Pseudocode Yes Algorithm 1 Online WGF Fitted Q-iteration; Algorithm 2 Proximal Loss
Open Source Code No The paper does not provide an explicit statement about releasing code or a link to a code repository.
Open Datasets Yes We revisit the Cliff Walk domain with a modified reward structure (See appendix). We used fixed Monte Carlo (MC) targets from the optimal greedy policy. We use off-policy updates with bootstrapped targets and compare performance results with an agent trained using the QR loss (Dabney et al., 2017) on three common control tasks from the Open AI Gym (Brockman et al., 2016): Mountain Car, Cart Pole, and Lunar Lander.
Dataset Splits No The paper does not explicitly provide specific percentages, counts, or a detailed methodology for creating train/validation/test dataset splits. It mentions using 'fixed Monte Carlo (MC) targets' and 'Open AI Gym' tasks, which implies predefined environments, but not explicit data splits for replication.
Hardware Specification No The paper does not specify any particular hardware used for running the experiments, such as GPU/CPU models or cloud instance types.
Software Dependencies No The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup No We parameterize return distributions with a two-layer fully-connected neural network of 256 hidden units. We use off-policy updates with bootstrapped targets and compare performance results with an agent trained using the QR loss (Dabney et al., 2017) on three common control tasks from the Open AI Gym (Brockman et al., 2016): Mountain Car, Cart Pole, and Lunar Lander. This describes the model architecture and general update strategy, but it does not include specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training schedules.