MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks
Authors: Menghui Zhu, Minghuan Liu, Jian Shen, Zhicheng Zhang, Sheng Chen, Weinan Zhang, Deheng Ye, Yong Yu, Qiang Fu, Wei Yang
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the Map Go framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks. We conduct extensive experiments on a set of continuous control tasks. Specifically, we first compare the relabeled goals of FGI with HER [Andrychowicz et al., 2017] to verify the efficacy of FGI. Then we evaluate the Map Go framework on complicated goal-oriented learning benchmarks, indicating the higher sample efficiency compared with former model-free algorithms. Finally, we make a comprehensive ablation study to analyze the performance improvement of Map Go while also examining its limitation. |
| Researcher Affiliation | Collaboration | Menghui Zhu1 2 , Minghuan Liu1 , Jian Shen1 , Zhicheng Zhang1 , Sheng Chen2 , Weinan Zhang1 , Deheng Ye2 , Yong Yu1 , Qiang Fu2 and Wei Yang2 1Shanghai Jiao Tong University, Shanghai, China 2Tencent AI Lab, Shenzhen, China |
| Pseudocode | Yes | Algorithm 1 Model-Assisted Policy Optimization (Map Go) Input: Policy parameter θ, Q value parameter ω, and dynamics model parameter ψ Denv ; for i = 1 to K do (s0, g) Goal Gen(Denv, T , πθ); for t = 0 to h 1 do at πθ(st, g); st+1 M ( |st, at); rt rg(st, at, st+1); end for τ {s0, a0, r0, s1, a1, r1...}; Denv Denv {τ}; Update Mψ according to Eq. (5); Dreal FGI(Mψ, Denv, π, φ, r); πθ, Qω UMPO(πθ, Qω, Denv, Dreal, Mψ); end for |
| Open Source Code | Yes | Our code is available at https://github.com/apexrl/Map Go. |
| Open Datasets | Yes | We conduct the comparison experiments on 2D-World, a simple 2D navigation task... In this section, we aim to show the advantage of Map Go compared with previous methods on four more challenging goal-oriented tasks: Reacher, Half Cheetah, Fixed Ant Locomotion, and Diverse Ant Locomotion. In Reacher, we control a 2D robot to reach a randomly located target [Charlesworth and Montana, 2020]. Fixed Ant Locomotion and Diverse Ant Locomotion are similar to the environment in [Florensa et al., 2018], where the agent is asked to move to a target position. Half Cheetah resembles the environment in [Finn et al., 2017] and requires the agent to run while keeping a targeted speed until the end of the episodes. |
| Dataset Splits | No | The paper does not explicitly provide details about training/validation/test dataset splits, specific percentages, or how data was partitioned for reproduction. It only mentions using '5 different random seeds' for experiments. |
| Hardware Specification | No | The paper does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used to conduct the experiments. |
| Software Dependencies | No | The paper mentions using 'DDPG [Lillicrap et al., 2015] as the learning algorithm for all the methods' but does not specify any software versions for libraries, frameworks, or environments (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | In Reacher, the rollout maximum length is 50 and in the others, we set it as 30. We choose the maximum rollout length H as 20 and do not use extra relabeling methods from HER. We utilize DDPG [Lillicrap et al., 2015] as the learning algorithm for all the methods. |