Implicit Distributional Reinforcement Learning

Authors: Yuguang Yue, Zhendong Wang, Mingyuan Zhou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments serve to answer the following questions: (a) How does IDAC perform when compared to state-of-the-art baselines, including SAC (Haarnoja et al., 2018), TD3 (Fujimoto et al., 2018), and PPO (Schulman et al., 2017)? (b) Can a semi-implicit policy capture complex distributional properties such as skewness, multi-modality, and covariance structure? (c) How well is the distributional matching when minimizing the quantile regression Huber loss? (d) How important is the type of policy distribution, such as a semi-implicit policy, a diagonal Gaussian policy, or a deterministic policy under this framework? (e) How much improvement does using distributional critics bring? (f) How critical is the twin-delayed network? (g) Will other baselines (such as SAC) benefit from using multiple actions (J > 1) for policy gradient estimation? We will show two sets of experiments, one for evaluation study and the other for ablation study, to answer the aforementioned questions.
Researcher Affiliation Academia Yuguang Yue , Zhendong Wang , and Mingyuan Zhou The University of Texas at Austin Austin, TX 78712
Pseudocode Yes We provide an overview of the algorithm here and defer a pseudo code with all implementation details to Appendix B.
Open Source Code Yes Python code is provided1. 1https://github.com/zhougroup/IDAC
Open Datasets Yes We conduct empirical comparisons on the benchmark tasks provided by Open AI Gym (Brockman et al., 2016) and Mu Jo Co simulators (Todorov et al., 2012).
Dataset Splits No The paper describes evaluation procedures for performance within continuous control environments but does not specify explicit train/validation/test splits of a static dataset.
Hardware Specification Yes The authors acknowledge the support of ... NVIDIA Corporation with the donation of the Titan Xp GPU used for this research, and the Texas Advanced Computing Center (TACC) at The University of Texas at Austin for providing HPC resources...
Software Dependencies No The paper mentions that algorithms are 'built upon the stable baselines codebase (https://github.com/hill-a/stable-baselines) of Hill et al. (2018)', but it does not provide specific version numbers for software dependencies like Python, deep learning frameworks, or libraries.
Experiment Setup Yes IDAC is implemented with a uniform set of hyperparameters to guarantee fair comparisons. Specifically, we use three separate fully-connected multilayer perceptrons (MLPs), which all have two 256-unit hidden layers and Re LU nonlinearities... We fix for all experiments the number of noise ξ(ℓ) as L = 21. We set the number of equally-spaced quantiles (the same as the number of ϵ(k)) as K = 51 and number of auxiliary actions as J = 51 by default. A more detailed parameter setting can be found in Appendix C.