QUOTA: The Quantile Option Architecture for Reinforcement Learning
Authors: Shangtong Zhang, Hengshuai Yao5797-5804
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the performance advantage of QUOTA in both challenging video games and physical robot simulators. |
| Researcher Affiliation | Collaboration | Shangtong Zhang, 1 Hengshuai Yao 2 1 Department of Computing Science, University of Alberta 2 Reinforcement Learning for Autonomous Driving Lab, Noah s Ark Lab, Huawei shangtong.zhang@ualberta.ca, hengshuai.yao@huawei.com |
| Pseudocode | Yes | The pseudo code of QUOTA is provided in Supplementary Material. |
| Open Source Code | Yes | All the implementations are made publicly available. 1https://github.com/Shangtong Zhang/Deep RL |
| Open Datasets | Yes | We evaluated QUOTA in both Arcade Learning Environment (ALE) (Bellemare et al. 2013) and Roboschool 2 |
| Dataset Splits | No | The paper mentions training steps and performing |
| Hardware Specification | No | The paper does not provide specific hardware details like GPU/CPU models used for running the experiments. It only mentions running experiments. |
| Software Dependencies | No | The paper mentions using RMSProp optimizer and Huber loss, but it does not specify version numbers for any software dependencies, libraries, or frameworks used (e.g., TensorFlow, PyTorch version). |
| Experiment Setup | Yes | We used 16 synchronous workers, and the rollout length is 5, resulting in a batch size 80. We trained each agent for 40M steps with frameskip 4, resulting in 160M frames in total. We used an RMSProp optimizer with an initial learning rate 10 4. The discount factor is 0.99. The ϵ for action selection was linearly decayed from 1.0 to 0.05 in the first 4M training steps and remained 0.05 afterwards. We used 200 quantiles to approximate the distribution and set the Huber loss parameter κ to 1. We used 10 options in QUOTA (M = 10) , and ϵΩwas linearly decayed from 1.0 to 0 during the 40M training steps. β was fixed at 0.01. |