reproducibilityindex.ai

Divide-and-Conquer Reinforcement Learning

Authors: Dibya Ghosh, Avi Singh, Aravind Rajeswaran, Vikash Kumar, Sergey Levine

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that divide-and-conquer RL greatly outperforms conventional policy gradient methods on challenging grasping, manipulation, and locomotion tasks, and exceeds the performance of a variety of prior methods. Videos of policies learned by our algorithm can be viewed at https://sites.google.com/view/dnc-rl/.
Researcher Affiliation	Academia	Dibya Ghosh1, Avi Singh1, Aravind Rajeswaran2, Vikash Kumar2, Sergey Levine1 1 University of California Berkeley 2 University of Washington Seattle
Pseudocode	Yes	The algorithm is laid out fully in pseudocode below. R Distillation Period function DNC( ) Sample initial states s0 from the task Produce contexts ω1, ω2, . . . ωn by clustering initial states s0 Randomly initialize central policy πc for t = 1, 2 . . . until convergence do Set πi = πc for all i = 1 . . . n for R iterations do Collect trajectories Ti in context ωi using policy πi for all i = 1 . . . n for all local policies πi do Take gradient step in surrogate loss L wrt πi Minimize Lcenter w.r.t. πc using previously sampled states (Ti)n i=1 return πc
Open Source Code	No	The paper provides a link for videos of policies but not for the source code of the described methodology. "Videos of policies learned by our algorithm can be viewed at https://sites.google.com/view/dnc-rl/."
Open Datasets	No	The paper describes custom-designed simulated environments (Kinova Jaco, Mu Jo Co simulation for Picking, Lobbing, Catching, Ant Position, Stairs tasks) rather than using a publicly available or open dataset with provided access information. "All of our environments are designed and simulated in Mu Jo Co (Todorov et al., 2012)."
Dataset Splits	No	The paper does not explicitly provide training/test/validation dataset splits. Reinforcement learning typically learns through interaction with an environment rather than predefined dataset splits.
Hardware Specification	No	The paper describes simulating a robotic arm ("we simulate the Kinova Jaco") but does not specify the actual hardware (e.g., CPU, GPU models, memory) used to run these simulations or train the models.
Software Dependencies	No	The paper mentions software components like TRPO, MuJoCo, and k-means clustering but does not provide specific version numbers for these dependencies. "All of our environments are designed and simulated in Mu Jo Co (Todorov et al., 2012)."
Experiment Setup	Yes	The primary hyperparameters of concern are the TRPO learning rate DKL and the penalty α. The TRPO learning rate is global to the task; for each task, to ﬁnd an appropriate learning rate, we ran TRPO with ﬁve learning rates {.0025, .005, .01, .02, .04}. The penalty parameter with the highest ﬁnal reward was selected for each algorithm on each task. Furthermore, performance of policy gradient methods like TRPO varies signiﬁcantly from run to run, so we run each experiment with 5 random seeds, reporting mean statistics and standard deviations.