Relative Policy-Transition Optimization for Fast Policy Transfer
Authors: Jiawei Xu, Cheng Zhou, Yizheng Zhang, Baoxiang Wang, Lei Han
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of RPTO on a set of Mu Jo Co continuous control tasks by creating policy transfer problems via variant dynamics. In this section, we perform comprehensive experiments across diverse control tasks to assess the efficacy of our RPTO. Additionally, we analyze how RPT, RTO, and RPTO individually contribute to learning. |
| Researcher Affiliation | Collaboration | Jiawei Xu1,2*, Cheng Zhou1, Yizheng Zhang1, Baoxiang Wang2, Lei Han1* 1Tencent Robotics X 2The Chinese University of Hong Kong, Shenzhen jiaweixu1@link.cuhk.edu.cn, {mikechzhou,yizhenzhang}@tencent.com bxiangwang@cuhk.edu.cn, lxhan@tencent.com |
| Pseudocode | Yes | Algorithm 1: Relative Policy-Transition Optimization. |
| Open Source Code | No | The paper does not provide concrete access to its own source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described. |
| Open Datasets | Yes | We experiment on a set of Mu Jo Co continuous control tasks with the standard neural network (NN) based probabilistic dynamics model, including Ant, Hopper, Half Cheetah, Walker2d and Swimmer. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions software environments like Mu Jo Co and Open AI gym but does not provide specific version numbers for these or any other ancillary software dependencies, such as programming languages or deep learning frameworks. |
| Experiment Setup | Yes | We set the ensemble size to 7 and each ensemble network has 4 fully connected layers with 400 units. Each head of dynamics model is a probabilistic neural network which outputs Gaussian distribution with diagonal covariance: pi ϕ(st+1, rt|, st, at) = N µi ϕ(st, at), Pi ϕ(st, at) . We set the model horizon to 1 and the replay ratio of dynamics to 1 for all environments. Other implementation details are provided in Appendix G. |