Learning Robust Options by Conditional Value at Risk Optimization
Authors: Takuya Hiraoka, Takahisa Imagawa, Tatsuya Mori, Takashi Onishi, Yoshimasa Tsuruoka
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments to evaluate our method in multi-joint robot control tasks (Hopper Ice Block, Half-Cheetah, and Walker2D). Experimental results show that our method produces options that 1) give better worst-case performance than the options learned only to minimize the average-case loss, and 2) give better average-case performance than the options learned only to minimize the worst-case loss. |
| Researcher Affiliation | Collaboration | Takuya Hiraoka 1,2,3, Takahisa Imagawa 2, Tatsuya Mori 1,2,3, Takashi Onishi 1,2, Yoshimasa Tsuruoka 2,4 1NEC Corporation 2National Institute of Advanced Industrial Science and Technology 3RIKEN Center for Advanced Intelligence Project 4The University of Tokyo |
| Pseudocode | Yes | Algorithm 1 shows a pseudocode for learning options with the CVa R constraint. |
| Open Source Code | Yes | Source code to replicate the experiments is available at https://github.com/TakuyaHiraoka/Learning-Robust-Options-by-Conditional-Value-at-Risk-Optimization |
| Open Datasets | Yes | The experiments are conducted in the robust MDP extension of the following environments: Half-Cheetah: ... [33]. Walker2D: ... [5]. Hopper Ice Block: ... [13, 16]. ... For the model parameter distribution, we prepare two types of distribution: continuous and discrete. For the continuous distribution, as in Rajeswaran et al. [24], we use a truncated Gaussian distribution, which follows the hyperparameters described in Table 1 in Appendices. For the discrete distribution, we use a Bernoulli distribution, which follows hyperparameters described in Table 2 in Appendices. |
| Dataset Splits | No | The paper discusses training and evaluating models within reinforcement learning environments but does not specify explicit training, validation, and testing dataset splits (e.g., percentages or counts) as typically defined for static datasets. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using and adapting PPOC [16] and Proximal Policy Optimization [27] but does not specify version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | For all of the aforementioned methods, we set the hyper-parameters (e.g., policy and value network architecture and learning rate) for PPOC to the same values as in the original paper [16]. The parameters of the policy network and the value network are updated when the total number of trajectory reaches 10240. This parameter update is repeated 977 times for each learning trial. |