Policy Optimization with Model-Based Explorations
Authors: Feiyang Pan, Qingpeng Cai, An-Xiang Zeng, Chun-Xiang Pan, Qing Da, Hualin He, Qing He, Pingzhong Tang4675-4682
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare POME with PPO on Atari 2600 games, and it shows that POME outperforms PPO on 33 games out of 49 games. We verify POME on the Atari 2600 game playing benchmarks. We compare our POME with the original model-free PPO algorithm and a model-based extended version of PPO. Experimental results show that POME outperforms the original PPO on 33 Atari games out of 49. |
| Researcher Affiliation | Collaboration | 1Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China. {panfeiyang, heqing}@ict.ac.cn 2University of Chinese Academy of Sciences, Beijing 100049, China. 3IIIS, Tsinghua University. cqp14@mails.tsinghua.edu.cn, kenshinping@gmail.com 4Alibaba Group. {renzhong, xuanran}@taobao.com, {daqing.dq, hualin.hhl}@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1 Policy Optimization with Model-based Explorations (single worker) |
| Open Source Code | No | The paper mentions using 'a standard open-sourced PPO implementation (Dhariwal et al. 2017)' with a GitHub link provided for it, but does not state that the code for their proposed method (POME) is open-source or provide a link to it. |
| Open Datasets | Yes | We verify POME on the Atari 2600 game playing benchmarks. We use the Arcade Learning Environment (Bellemare et al. 2013) benchmarks along with a standard open-sourced PPO implementation (Dhariwal et al. 2017). |
| Dataset Splits | No | The paper mentions 'training for 10M timesteps' and discusses 'minibatch size' and 'rollout trajectory', but does not provide specific training/validation/test dataset splits (percentages, counts, or references to predefined splits). |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'a standard open-sourced PPO implementation (Dhariwal et al. 2017)' and 'Adam gradient descent optimizer (Kingma and Ba 2014)', but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | For all the algorithms, the discount factor γ is set to 0.99 and the advantage values are estimated by the k-step error (16) where the horizon k is set to be 128. We use 8 actors (workers) to simultaneously run the algorithm and the minibatch size is 128 * 8. ... L = LPOME(θ) + cv Lv(φ) + c T LT (θ T), which is optimized by the Adam gradient descent optimizer (Kingma and Ba 2014) with learning rate 2.5e-4 f, where f is a fraction linearly annealed from 1 to 0 over the course of learning, and cv, c T are coefficients for tuning the learning rate of the value function and the transition function. In our experiments we set these coefficients to cv = 1 and c T = 2. |