reproducibilityindex.ai

Model Learning for Look-Ahead Exploration in Continuous Control

Authors: Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki3151-3158

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods, and converges to better policies than methods that use options or parametrized skills as building blocks of the policy itself, as opposed to guiding exploration. We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods, and converges to better policies than methods that use options or parameterized skills as building blocks of the policy itself, as opposed to guiding exploration.
Researcher Affiliation	Academia	Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki Carnegie Mellon University United States {arpita1,katharam}@andrew.cmu.edu, katef@cs.cmu.edu
Pseudocode	Yes	Our look-ahead exploration is described in Algorithm 2 and visualized in Figure 2. The complete exploration and reinforcement learning method is described in Algorithm 1.
Open Source Code	Yes	Our code is available at https://github.com/arpit15/skillbased-exploration-drl
Open Datasets	No	The paper uses the MuJoCo simulation environment for experiments and training. It does not mention using an existing public dataset or provide access information for any generated training data.
Dataset Splits	No	The paper states: 'For evaluation, we freeze the current policy and sample 20 random initial states and goals at each epoch (1 epoch = 16 episodes of environment interaction).' This describes an evaluation process, but it does not specify explicit training, validation, and test dataset splits with percentages or counts.
Hardware Specification	No	The paper mentions using a 'seven degree of freedom Baxter robot arm with parallel jaw grippers', but this is in the context of the simulated environment, not the hardware used to run the simulations or training. It also discusses computational time for different branching factors ('agent takes 0.4 seconds per episode', '17 seconds', '71 seconds', '286 seconds') and notes that parallelization on GPU 'will render our tree search much more efﬁcient' but does not specify the actual hardware (CPU/GPU models, memory, etc.) used for the experiments.
Software Dependencies	No	The paper mentions 'Mu Jo Co simulation environment (Todorov, Erez, and Tassa 2012)' and uses methods like 'Hindsight Experience Replay (Andrychowicz et al. 2017) (HER)' and 'off-policy deep deterministic policy gradients (DDPG) (Lillicrap et al. 2015)'. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or the MuJoCo simulator itself.
Experiment Setup	Yes	We vary ϵ to be close to 1 in the beginning of training, and linearly decay it to 0.001. ... After unfolding the tree for a prespeciﬁed number of steps, we choose the path with the maximum total reward... With ϵ-greedy the agent takes 0.4 seconds per episode (50 steps), with branching factor (bf) equal to 5... all the reported results use bf=5. ... The collected data is used to train deep neural regressors for each skill, a three layer fully connected network that takes as input a state and a goal conﬁgurations and predicts the ﬁnal state reached after skill execution, and the probability of success. ... For evaluation, we freeze the current policy and sample 20 random initial states and goals at each epoch (1 epoch = 16 episodes of environment interaction).