reproducibilityindex.ai

Stochastically Dominant Distributional Reinforcement Learning

Authors: John Martin, Michal Lyskawinski, Xiaohu Li, Brendan Englot

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments characterize the algorithm s performance and demonstrate how uncertainty and performance are better balanced using an SSD policy than with other risk measures. We validate our theoretical claims with several targeted experiments. The main hypothesis we test is that the SSD policy induces the least-disperse data distribution from which optimality can be achieved when learning off-policy.
Researcher Affiliation	Academia	John D. Martin 1 Michal Lyskawinski 1 Xiaohu Li 1 Brendan Englot 1 1Stevens Institute of Technology, Hoboken, New Jersey, USA. Correspondence to: John D. Martin <jmarti3@stevens.edu>.
Pseudocode	Yes	Algorithm 1 Online WGF Fitted Q-iteration; Algorithm 2 Proximal Loss
Open Source Code	No	The paper does not provide an explicit statement about releasing code or a link to a code repository.
Open Datasets	Yes	We revisit the Cliff Walk domain with a modiﬁed reward structure (See appendix). We used ﬁxed Monte Carlo (MC) targets from the optimal greedy policy. We use off-policy updates with bootstrapped targets and compare performance results with an agent trained using the QR loss (Dabney et al., 2017) on three common control tasks from the Open AI Gym (Brockman et al., 2016): Mountain Car, Cart Pole, and Lunar Lander.
Dataset Splits	No	The paper does not explicitly provide specific percentages, counts, or a detailed methodology for creating train/validation/test dataset splits. It mentions using 'fixed Monte Carlo (MC) targets' and 'Open AI Gym' tasks, which implies predefined environments, but not explicit data splits for replication.
Hardware Specification	No	The paper does not specify any particular hardware used for running the experiments, such as GPU/CPU models or cloud instance types.
Software Dependencies	No	The paper does not provide specific version numbers for any software dependencies or libraries used in the experiments.
Experiment Setup	No	We parameterize return distributions with a two-layer fully-connected neural network of 256 hidden units. We use off-policy updates with bootstrapped targets and compare performance results with an agent trained using the QR loss (Dabney et al., 2017) on three common control tasks from the Open AI Gym (Brockman et al., 2016): Mountain Car, Cart Pole, and Lunar Lander. This describes the model architecture and general update strategy, but it does not include specific hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed training schedules.