MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks

Authors: Menghui Zhu, Minghuan Liu, Jian Shen, Zhicheng Zhang, Sheng Chen, Weinan Zhang, Deheng Ye, Yong Yu, Qiang Fu, Wei Yang

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the Map Go framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks. We conduct extensive experiments on a set of continuous control tasks. Specifically, we first compare the relabeled goals of FGI with HER [Andrychowicz et al., 2017] to verify the efficacy of FGI. Then we evaluate the Map Go framework on complicated goal-oriented learning benchmarks, indicating the higher sample efficiency compared with former model-free algorithms. Finally, we make a comprehensive ablation study to analyze the performance improvement of Map Go while also examining its limitation.
Researcher Affiliation Collaboration Menghui Zhu1 2 , Minghuan Liu1 , Jian Shen1 , Zhicheng Zhang1 , Sheng Chen2 , Weinan Zhang1 , Deheng Ye2 , Yong Yu1 , Qiang Fu2 and Wei Yang2 1Shanghai Jiao Tong University, Shanghai, China 2Tencent AI Lab, Shenzhen, China
Pseudocode Yes Algorithm 1 Model-Assisted Policy Optimization (Map Go) Input: Policy parameter θ, Q value parameter ω, and dynamics model parameter ψ Denv ; for i = 1 to K do (s0, g) Goal Gen(Denv, T , πθ); for t = 0 to h 1 do at πθ(st, g); st+1 M ( |st, at); rt rg(st, at, st+1); end for τ {s0, a0, r0, s1, a1, r1...}; Denv Denv {τ}; Update Mψ according to Eq. (5); Dreal FGI(Mψ, Denv, π, φ, r); πθ, Qω UMPO(πθ, Qω, Denv, Dreal, Mψ); end for
Open Source Code Yes Our code is available at https://github.com/apexrl/Map Go.
Open Datasets Yes We conduct the comparison experiments on 2D-World, a simple 2D navigation task... In this section, we aim to show the advantage of Map Go compared with previous methods on four more challenging goal-oriented tasks: Reacher, Half Cheetah, Fixed Ant Locomotion, and Diverse Ant Locomotion. In Reacher, we control a 2D robot to reach a randomly located target [Charlesworth and Montana, 2020]. Fixed Ant Locomotion and Diverse Ant Locomotion are similar to the environment in [Florensa et al., 2018], where the agent is asked to move to a target position. Half Cheetah resembles the environment in [Finn et al., 2017] and requires the agent to run while keeping a targeted speed until the end of the episodes.
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits, specific percentages, or how data was partitioned for reproduction. It only mentions using '5 different random seeds' for experiments.
Hardware Specification No The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments.
Software Dependencies No The paper mentions using 'DDPG [Lillicrap et al., 2015] as the learning algorithm for all the methods' but does not specify any software versions for libraries, frameworks, or environments (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes In Reacher, the rollout maximum length is 50 and in the others, we set it as 30. We choose the maximum rollout length H as 20 and do not use extra relabeling methods from HER. We utilize DDPG [Lillicrap et al., 2015] as the learning algorithm for all the methods.