KoGuN: Accelerating Deep Reinforcement Learning via Integrating Human Suboptimal Knowledge
Authors: Peng Zhang, Jianye Hao, Weixun Wang, Hongyao Tang, Yi Ma, Yihai Duan, Yan Zheng
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on both discrete and continuous control tasks. The empirical results show that our approach, which combines human suboptimal knowledge and RL, achieves significant improvement on learning efficiency of flat RL algorithms, even with very lowperformance human prior knowledge. |
| Researcher Affiliation | Collaboration | 1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei 3Tianjin Key Lab of Machine Learning |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating the availability of open-source code for the described methodology. |
| Open Datasets | Yes | Cart Pole [Barto and Sutton, 1982], Lunar Lander and Lunar Lander Continuous in Open AI Gym [Brockman et al., 2016] and Flappy Bird in PLE [Tasfi, 2016]. |
| Dataset Splits | No | The paper refers to general MDP components like state, action spaces and policies, and mentions 'train', 'validation', and 'test' in the context of general machine learning concepts, but does not provide specific train/validation/test dataset split information for its own experiments. |
| Hardware Specification | No | The paper does not specify any hardware details such as GPU/CPU models, memory, or specific computing environments used for running the experiments. |
| Software Dependencies | No | The paper mentions using Adam optimizer and PPO, but it does not specify version numbers for any software dependencies or libraries required to replicate the experiments. |
| Experiment Setup | Yes | The experimental setup is as follows: for all the tasks, we use Adam optimizer [Kingma and Ba, 2014] with a learning rate of 1 10 4 and the temperature τ = 0.1. The discounted factor γ is set to 0.99 and the GAE λ is set to 0.95. The policy is updated every 128 timesteps. For PPO without Ko Gu N, we use a neural network with two full-connected hidden layers as policy approximator. For Ko Gu N with normal network (Ko Gu N-concat) as refine module, we also use a neural network with two full-connected hidden layers for the refine module. For Ko Gu N with hypernetworks (Ko Gu N-hyper), we use hypernetworks to generate a refine module with one hidden layer. Each hypernetwork has two hidden layers. All hidden layers described above have 32 units. w1 is set to 0.7 at beginning and decays to 0.1 in the end of training phase. |