Control Regularization for Reduced Variance Reinforcement Learning
Authors: Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, Joel Burdick
ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone. |
| Researcher Affiliation | Academia | 1California Institute of Technology, Pasadena, CA 2Rice University, Houston, TX 3University of Michigan, Ann Arbor, MI. |
| Pseudocode | Yes | Algorithm 1 Control Regularized RL (CORE-RL) |
| Open Source Code | Yes | All code can be found at https://github.com/rcheng805/CORE-RL. |
| Open Datasets | Yes | We apply the CORE-RL algorithm to control of the cartpole from the Open AI gym environment (Cart Pole-v1). ... The experimental setup and data collection process are described in (Ge et al., 2018). |
| Dataset Splits | No | The paper describes running experiments multiple times with different random seeds and splitting data into episodes, but does not provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splits for the environments). |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like DDPG, PPO, TRPO, Open AI gym, and TORCS, but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For all three problems, we use DDPG as the policy gradient RL algorithm (Lillicrap et al., 2016). We use a neural network with 2 hidden layers with 64 neurons in each layer. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. We use a batch size of 64, and discount factor of 0.99. We use a replay buffer with size of 106. We found that the Adaptive Mixing Strategy performs best when λmax = 50, and C = 0.0005. |