Is Bang-Bang Control All You Need? Solving Continuous Control with Bernoulli Policies

Authors: Tim Seyde, Igor Gilitschenski, Wilko Schwarting, Bartolomeo Stellato, Martin Riedmiller, Markus Wulfmeier, Daniela Rus

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we address these questions in two ways. First, we provide a theoretical intuition for bang-bang behavior in reinforcement learning. This is based on drawing connections to minimumtime problems where bang-bang control is often provably optimal. Second, we perform a set of experiments which optimize controllers via on-policy and off-policy learning as well as model-free and model-based state-of-the-art RL methods. Therein we compare the original algorithms with a slight modification where the Gaussian policy head is replaced with the Bernoulli distribution resulting in a bang-bang controller.
Researcher Affiliation Collaboration Tim Seyde1 MIT CSAIL Igor Gilitschenski University of Toronto Wilko Schwarting MIT CSAIL Bartolomeo Stellato Princeton University Martin Riedmiller Deep Mind Markus Wulfmeier2 Deep Mind Daniela Rus2 MIT CSAIL
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No Please find videos and additional details at https://sites.google.com/view/bang-bang-rl. This link is for videos and additional details, not explicitly stated as providing the source code for the methodology.
Open Datasets Yes We investigate performance of Bang-Bang policies on several continuous control problems from the Deep Mind Control Suite. Figure 2 provides learning curves for PPO, SAC, MPO, and Dreamer V2. [49] Y. Tassa, S. Tunyasuvunakool, A. Muldal, Y. Doron, S. Liu, S. Bohez, J. Merel, T. Erez, T. Lillicrap, and N. Heess. dm_control: Software and tasks for continuous control, 2020.
Dataset Splits No The paper does not explicitly provide details about training/validation/test splits, such as percentages or specific sample counts for each split. It uses tasks from the Deep Mind Control Suite but does not specify how data was split for these experiments.
Hardware Specification No The authors further would like to thank Lucas Liebenwein for assistance with cluster deployment, and acknowledge the MIT Super Cloud and Lincoln Laboratory Supercomputing Center for providing HPC resources. This statement is too general and does not provide specific hardware details like CPU/GPU models or memory amounts.
Software Dependencies No The paper mentions several algorithms and frameworks (PPO, SAC, MPO, Dreamer V2, Deep Mind Control Suite, Acme, Tonic) but does not provide specific version numbers for these software components.
Experiment Setup No The paper describes the algorithms used and the environments, but it does not provide specific experimental setup details such as hyperparameter values (e.g., learning rates, batch sizes, epochs) or detailed training configurations.