Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning

Authors: Wenjie Shi, Shiji Song, Cheng Wu

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed algorithm, namely DSPG, across several benchmark continuous control tasks and compare them to standard DDPG and SAC implementations. We find that DSPG can consistently match or beat the performance of these baselines.
Researcher Affiliation Academia Wenjie Shi , Shiji Song and Cheng Wu Department of Automation, Tsinghua University, Beijing, China shiwj16@mails.tsinghua.edu.cn, shijis@mail.tsinghua.edu.cn, wuc@tsinghua.edu.cn
Pseudocode Yes Algorithm 1: DSPG Algorithm
Open Source Code No The source code of our DSPG implementation will be available online after the paper is accepted.
Open Datasets Yes We chose four well-known benchmark continuous control tasks (Ant, Hopper, Half Cheetah and Walker2d) available from Open AI Gym and utilizing Mu Jo Co environment.
Dataset Splits No The paper mentions '7 randomly seeded training runs' and evaluating performance through 'evaluation rollouts' but does not specify explicit train/validation/test dataset splits with percentages or counts. The environments (Ant, Hopper, Half Cheetah, Walker2d) are used for training and evaluation without explicitly defined dataset splits.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory specifications) used for running experiments were provided in the paper.
Software Dependencies No The paper mentions using 'Adam for learning the neural network parameters' but does not provide specific software dependencies with version numbers for libraries or frameworks like Python, PyTorch/TensorFlow, etc.
Experiment Setup Yes Throughout all experiments, we use Adam for learning the neural network parameters with a learning rate 5 10 5 and 5 10 4 for the actor and critic respectively. For critic we use a discount factor of γ = 0.99. For the soft target updates we use α = 0.01. Both the actor and critic are represented by full-connected feed-forward neural network with two hidden layers of dimensions 512. And all hidden layers use Re LU activation. Specially, we use identity and sigmoid activations for the mean and standard deviation in the output layer respectively. The algorithm use a replay buffer size of three million and train with minibatch sizes of 100 for each train step. Training does not start until the replay buffer has enough samples for a minibatch and does not stop until the global time step equals to the threshold of 3 106. In addition, we scales the reward function by a factor of 5 for all four tasks, as is common in prior works.