Model Learning for Look-Ahead Exploration in Continuous Control

Authors: Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki3151-3158

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods, and converges to better policies than methods that use options or parametrized skills as building blocks of the policy itself, as opposed to guiding exploration. We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods, and converges to better policies than methods that use options or parameterized skills as building blocks of the policy itself, as opposed to guiding exploration.
Researcher Affiliation Academia Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki Carnegie Mellon University United States {arpita1,katharam}@andrew.cmu.edu, katef@cs.cmu.edu
Pseudocode Yes Our look-ahead exploration is described in Algorithm 2 and visualized in Figure 2. The complete exploration and reinforcement learning method is described in Algorithm 1.
Open Source Code Yes Our code is available at https://github.com/arpit15/skillbased-exploration-drl
Open Datasets No The paper uses the MuJoCo simulation environment for experiments and training. It does not mention using an existing public dataset or provide access information for any generated training data.
Dataset Splits No The paper states: 'For evaluation, we freeze the current policy and sample 20 random initial states and goals at each epoch (1 epoch = 16 episodes of environment interaction).' This describes an evaluation process, but it does not specify explicit training, validation, and test dataset splits with percentages or counts.
Hardware Specification No The paper mentions using a 'seven degree of freedom Baxter robot arm with parallel jaw grippers', but this is in the context of the simulated environment, not the hardware used to run the simulations or training. It also discusses computational time for different branching factors ('agent takes 0.4 seconds per episode', '17 seconds', '71 seconds', '286 seconds') and notes that parallelization on GPU 'will render our tree search much more efficient' but does not specify the actual hardware (CPU/GPU models, memory, etc.) used for the experiments.
Software Dependencies No The paper mentions 'Mu Jo Co simulation environment (Todorov, Erez, and Tassa 2012)' and uses methods like 'Hindsight Experience Replay (Andrychowicz et al. 2017) (HER)' and 'off-policy deep deterministic policy gradients (DDPG) (Lillicrap et al. 2015)'. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or the MuJoCo simulator itself.
Experiment Setup Yes We vary ϵ to be close to 1 in the beginning of training, and linearly decay it to 0.001. ... After unfolding the tree for a prespecified number of steps, we choose the path with the maximum total reward... With ϵ-greedy the agent takes 0.4 seconds per episode (50 steps), with branching factor (bf) equal to 5... all the reported results use bf=5. ... The collected data is used to train deep neural regressors for each skill, a three layer fully connected network that takes as input a state and a goal configurations and predicts the final state reached after skill execution, and the probability of success. ... For evaluation, we freeze the current policy and sample 20 random initial states and goals at each epoch (1 epoch = 16 episodes of environment interaction).