reproducibilityindex.ai

Learning Robust Options by Conditional Value at Risk Optimization

Authors: Takuya Hiraoka, Takahisa Imagawa, Tatsuya Mori, Takashi Onishi, Yoshimasa Tsuruoka

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments to evaluate our method in multi-joint robot control tasks (Hopper Ice Block, Half-Cheetah, and Walker2D). Experimental results show that our method produces options that 1) give better worst-case performance than the options learned only to minimize the average-case loss, and 2) give better average-case performance than the options learned only to minimize the worst-case loss.
Researcher Affiliation	Collaboration	Takuya Hiraoka 1,2,3, Takahisa Imagawa 2, Tatsuya Mori 1,2,3, Takashi Onishi 1,2, Yoshimasa Tsuruoka 2,4 1NEC Corporation 2National Institute of Advanced Industrial Science and Technology 3RIKEN Center for Advanced Intelligence Project 4The University of Tokyo
Pseudocode	Yes	Algorithm 1 shows a pseudocode for learning options with the CVa R constraint.
Open Source Code	Yes	Source code to replicate the experiments is available at https://github.com/TakuyaHiraoka/Learning-Robust-Options-by-Conditional-Value-at-Risk-Optimization
Open Datasets	Yes	The experiments are conducted in the robust MDP extension of the following environments: Half-Cheetah: ... [33]. Walker2D: ... [5]. Hopper Ice Block: ... [13, 16]. ... For the model parameter distribution, we prepare two types of distribution: continuous and discrete. For the continuous distribution, as in Rajeswaran et al. [24], we use a truncated Gaussian distribution, which follows the hyperparameters described in Table 1 in Appendices. For the discrete distribution, we use a Bernoulli distribution, which follows hyperparameters described in Table 2 in Appendices.
Dataset Splits	No	The paper discusses training and evaluating models within reinforcement learning environments but does not specify explicit training, validation, and testing dataset splits (e.g., percentages or counts) as typically defined for static datasets.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using and adapting PPOC [16] and Proximal Policy Optimization [27] but does not specify version numbers for any software dependencies or libraries.
Experiment Setup	Yes	For all of the aforementioned methods, we set the hyper-parameters (e.g., policy and value network architecture and learning rate) for PPOC to the same values as in the original paper [16]. The parameters of the policy network and the value network are updated when the total number of trajectory reaches 10240. This parameter update is repeated 977 times for each learning trial.