reproducibilityindex.ai

Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Authors: Wenjie Shi, Shiji Song, Cheng Wu

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our proposed algorithm, namely DSPG, across several benchmark continuous control tasks and compare them to standard DDPG and SAC implementations. We ﬁnd that DSPG can consistently match or beat the performance of these baselines.
Researcher Affiliation	Academia	Wenjie Shi , Shiji Song and Cheng Wu Department of Automation, Tsinghua University, Beijing, China shiwj16@mails.tsinghua.edu.cn, shijis@mail.tsinghua.edu.cn, wuc@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1: DSPG Algorithm
Open Source Code	No	The source code of our DSPG implementation will be available online after the paper is accepted.
Open Datasets	Yes	We chose four well-known benchmark continuous control tasks (Ant, Hopper, Half Cheetah and Walker2d) available from Open AI Gym and utilizing Mu Jo Co environment.
Dataset Splits	No	The paper mentions '7 randomly seeded training runs' and evaluating performance through 'evaluation rollouts' but does not specify explicit train/validation/test dataset splits with percentages or counts. The environments (Ant, Hopper, Half Cheetah, Walker2d) are used for training and evaluation without explicitly defined dataset splits.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments were provided in the paper.
Software Dependencies	No	The paper mentions using 'Adam for learning the neural network parameters' but does not provide specific software dependencies with version numbers for libraries or frameworks like Python, PyTorch/TensorFlow, etc.
Experiment Setup	Yes	Throughout all experiments, we use Adam for learning the neural network parameters with a learning rate 5 10 5 and 5 10 4 for the actor and critic respectively. For critic we use a discount factor of γ = 0.99. For the soft target updates we use α = 0.01. Both the actor and critic are represented by full-connected feed-forward neural network with two hidden layers of dimensions 512. And all hidden layers use Re LU activation. Specially, we use identity and sigmoid activations for the mean and standard deviation in the output layer respectively. The algorithm use a replay buffer size of three million and train with minibatch sizes of 100 for each train step. Training does not start until the replay buffer has enough samples for a minibatch and does not stop until the global time step equals to the threshold of 3 106. In addition, we scales the reward function by a factor of 5 for all four tasks, as is common in prior works.