Relative Policy-Transition Optimization for Fast Policy Transfer

Authors: Jiawei Xu, Cheng Zhou, Yizheng Zhang, Baoxiang Wang, Lei Han

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of RPTO on a set of Mu Jo Co continuous control tasks by creating policy transfer problems via variant dynamics. In this section, we perform comprehensive experiments across diverse control tasks to assess the efficacy of our RPTO. Additionally, we analyze how RPT, RTO, and RPTO individually contribute to learning.
Researcher Affiliation Collaboration Jiawei Xu1,2*, Cheng Zhou1, Yizheng Zhang1, Baoxiang Wang2, Lei Han1* 1Tencent Robotics X 2The Chinese University of Hong Kong, Shenzhen jiaweixu1@link.cuhk.edu.cn, {mikechzhou,yizhenzhang}@tencent.com bxiangwang@cuhk.edu.cn, lxhan@tencent.com
Pseudocode Yes Algorithm 1: Relative Policy-Transition Optimization.
Open Source Code No The paper does not provide concrete access to its own source code (specific repository link, explicit code release statement, or code in supplementary materials) for the methodology described.
Open Datasets Yes We experiment on a set of Mu Jo Co continuous control tasks with the standard neural network (NN) based probabilistic dynamics model, including Ant, Hopper, Half Cheetah, Walker2d and Swimmer.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software environments like Mu Jo Co and Open AI gym but does not provide specific version numbers for these or any other ancillary software dependencies, such as programming languages or deep learning frameworks.
Experiment Setup Yes We set the ensemble size to 7 and each ensemble network has 4 fully connected layers with 400 units. Each head of dynamics model is a probabilistic neural network which outputs Gaussian distribution with diagonal covariance: pi ϕ(st+1, rt|, st, at) = N µi ϕ(st, at), Pi ϕ(st, at) . We set the model horizon to 1 and the replay ratio of dynamics to 1 for all environments. Other implementation details are provided in Appendix G.