Reinforcement Learning with Deep Energy-Based Policies

Authors: Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, Sergey Levine

ICML 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aim to answer the following questions: (1) Does our soft Q-learning method accurately capture a multi-modal policy distribution? (2) Can soft Q-learning with energy-based policies aid exploration for complex tasks that require tracking multiple modes? (3) Can a maximum entropy policy serve as a good initialization for finetuning on different tasks, when compared to pretraining with a standard deterministic objective? We compare our algorithm to DDPG (Lillicrap et al., 2015), which has been shown to achieve better sample efficiency on the continuous control problems that we consider than other recent techniques such as REINFORCE (Williams, 1992), TRPO (Schulman et al., 2015a), and A3C (Mnih et al., 2016). This comparison is particularly interesting since, as discussed in Section 4, DDPG closely corresponds to a deterministic maximum a posteriori variant of our method. The detailed experimental setup can be found in Appendix D. Videos of all experiments2 and example source code3 are available online.
Researcher Affiliation Collaboration 1UC Berkeley, Department of Electrical Engineering and Computer Sciences 2UC Berkeley, Department of Mathematics 3Open AI 4International Computer Science Institute. Correspondence to: Haoran Tang <hrtang@math.berkeley.edu>, Tuomas Haarnoja <haarnoja@berkeley.edu>.
Pseudocode Yes Algorithm 1 Soft Q-learning
Open Source Code Yes Videos of all experiments2 and example source code3 are available online. 3https://github.com/haarnoja/softqlearning
Open Datasets No The paper describes simulated environments (e.g., 'simulated experiments with swimming and walking robots', '2D multi-goal environment', 'simulated swimming snake', 'quadrupedal 3D robot') which are generated for the experiments, but does not provide concrete access information or citations for a publicly available, pre-existing dataset.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions software components like 'ADAM (Kingma & Ba, 2015) optimizer' but does not provide specific version numbers for any ancillary software or libraries needed to replicate the experiment.
Experiment Setup Yes The detailed experimental setup can be found in Appendix D. Appendix D.1 'Network Architectures and Training Details' describes specific details such as the use of two hidden layers with 64 units, ReLU activation, learning rates of 3e-4, batch size of 256, target network update every 1000 steps, and a replay buffer size of 10^6.