Divide-and-Conquer Reinforcement Learning

Authors: Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, Sergey Levine

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that divide-and-conquer RL greatly outperforms conventional policy gradient methods on challenging grasping, manipulation, and locomotion tasks, and exceeds the performance of a variety of prior methods. Videos of policies learned by our algorithm can be viewed at https://sites.google.com/view/dnc-rl/.
Researcher Affiliation Academia Dibya Ghosh1, Avi Singh1, Aravind Rajeswaran2, Vikash Kumar2, Sergey Levine1 1 University of California Berkeley 2 University of Washington Seattle
Pseudocode Yes The algorithm is laid out fully in pseudocode below. R Distillation Period function DNC( ) Sample initial states s0 from the task Produce contexts ω1, ω2, . . . ωn by clustering initial states s0 Randomly initialize central policy πc for t = 1, 2 . . . until convergence do Set πi = πc for all i = 1 . . . n for R iterations do Collect trajectories Ti in context ωi using policy πi for all i = 1 . . . n for all local policies πi do Take gradient step in surrogate loss L wrt πi Minimize Lcenter w.r.t. πc using previously sampled states (Ti)n i=1 return πc
Open Source Code No The paper provides a link for videos of policies but not for the source code of the described methodology. "Videos of policies learned by our algorithm can be viewed at https://sites.google.com/view/dnc-rl/."
Open Datasets No The paper describes custom-designed simulated environments (Kinova Jaco, Mu Jo Co simulation for Picking, Lobbing, Catching, Ant Position, Stairs tasks) rather than using a publicly available or open dataset with provided access information. "All of our environments are designed and simulated in Mu Jo Co (Todorov et al., 2012)."
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits. Reinforcement learning typically learns through interaction with an environment rather than predefined dataset splits.
Hardware Specification No The paper describes simulating a robotic arm ("we simulate the Kinova Jaco") but does not specify the actual hardware (e.g., CPU, GPU models, memory) used to run these simulations or train the models.
Software Dependencies No The paper mentions software components like TRPO, MuJoCo, and k-means clustering but does not provide specific version numbers for these dependencies. "All of our environments are designed and simulated in Mu Jo Co (Todorov et al., 2012)."
Experiment Setup Yes The primary hyperparameters of concern are the TRPO learning rate DKL and the penalty α. The TRPO learning rate is global to the task; for each task, to find an appropriate learning rate, we ran TRPO with five learning rates {.0025, .005, .01, .02, .04}. The penalty parameter with the highest final reward was selected for each algorithm on each task. Furthermore, performance of policy gradient methods like TRPO varies significantly from run to run, so we run each experiment with 5 random seeds, reporting mean statistics and standard deviations.