reproducibilityindex.ai

KoGuN: Accelerating Deep Reinforcement Learning via Integrating Human Suboptimal Knowledge

Authors: Peng Zhang, Jianye Hao, Weixun Wang, Hongyao Tang, Yi Ma, Yihai Duan, Yan Zheng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on both discrete and continuous control tasks. The empirical results show that our approach, which combines human suboptimal knowledge and RL, achieves signiﬁcant improvement on learning efﬁciency of ﬂat RL algorithms, even with very lowperformance human prior knowledge.
Researcher Affiliation	Collaboration	1College of Intelligence and Computing, Tianjin University 2Noah s Ark Lab, Huawei 3Tianjin Key Lab of Machine Learning
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements or links indicating the availability of open-source code for the described methodology.
Open Datasets	Yes	Cart Pole [Barto and Sutton, 1982], Lunar Lander and Lunar Lander Continuous in Open AI Gym [Brockman et al., 2016] and Flappy Bird in PLE [Tasﬁ, 2016].
Dataset Splits	No	The paper refers to general MDP components like state, action spaces and policies, and mentions 'train', 'validation', and 'test' in the context of general machine learning concepts, but does not provide specific train/validation/test dataset split information for its own experiments.
Hardware Specification	No	The paper does not specify any hardware details such as GPU/CPU models, memory, or specific computing environments used for running the experiments.
Software Dependencies	No	The paper mentions using Adam optimizer and PPO, but it does not specify version numbers for any software dependencies or libraries required to replicate the experiments.
Experiment Setup	Yes	The experimental setup is as follows: for all the tasks, we use Adam optimizer [Kingma and Ba, 2014] with a learning rate of 1 10 4 and the temperature τ = 0.1. The discounted factor γ is set to 0.99 and the GAE λ is set to 0.95. The policy is updated every 128 timesteps. For PPO without Ko Gu N, we use a neural network with two full-connected hidden layers as policy approximator. For Ko Gu N with normal network (Ko Gu N-concat) as reﬁne module, we also use a neural network with two full-connected hidden layers for the reﬁne module. For Ko Gu N with hypernetworks (Ko Gu N-hyper), we use hypernetworks to generate a reﬁne module with one hidden layer. Each hypernetwork has two hidden layers. All hidden layers described above have 32 units. w1 is set to 0.7 at beginning and decays to 0.1 in the end of training phase.