Control Regularization for Reduced Variance Reinforcement Learning

Authors: Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, Joel Burdick

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our approach empirically on a range of settings, and demonstrate significantly reduced variance, guaranteed dynamic stability, and more efficient learning than deep RL alone.
Researcher Affiliation Academia 1California Institute of Technology, Pasadena, CA 2Rice University, Houston, TX 3University of Michigan, Ann Arbor, MI.
Pseudocode Yes Algorithm 1 Control Regularized RL (CORE-RL)
Open Source Code Yes All code can be found at https://github.com/rcheng805/CORE-RL.
Open Datasets Yes We apply the CORE-RL algorithm to control of the cartpole from the Open AI gym environment (Cart Pole-v1). ... The experimental setup and data collection process are described in (Ge et al., 2018).
Dataset Splits No The paper describes running experiments multiple times with different random seeds and splitting data into episodes, but does not provide specific train/validation/test dataset splits (e.g., percentages, sample counts, or explicit splits for the environments).
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies No The paper mentions software components like DDPG, PPO, TRPO, Open AI gym, and TORCS, but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes For all three problems, we use DDPG as the policy gradient RL algorithm (Lillicrap et al., 2016). We use a neural network with 2 hidden layers with 64 neurons in each layer. We use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001. We use a batch size of 64, and discount factor of 0.99. We use a replay buffer with size of 106. We found that the Adaptive Mixing Strategy performs best when λmax = 50, and C = 0.0005.