Model Learning for Look-Ahead Exploration in Continuous Control
Authors: Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki3151-3158
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods, and converges to better policies than methods that use options or parametrized skills as building blocks of the policy itself, as opposed to guiding exploration. We show that the proposed exploration strategy results in effective learning of complex manipulation policies faster than current state-of-the-art RL methods, and converges to better policies than methods that use options or parameterized skills as building blocks of the policy itself, as opposed to guiding exploration. |
| Researcher Affiliation | Academia | Arpit Agarwal, Katharina Muelling, Katerina Fragkiadaki Carnegie Mellon University United States {arpita1,katharam}@andrew.cmu.edu, katef@cs.cmu.edu |
| Pseudocode | Yes | Our look-ahead exploration is described in Algorithm 2 and visualized in Figure 2. The complete exploration and reinforcement learning method is described in Algorithm 1. |
| Open Source Code | Yes | Our code is available at https://github.com/arpit15/skillbased-exploration-drl |
| Open Datasets | No | The paper uses the MuJoCo simulation environment for experiments and training. It does not mention using an existing public dataset or provide access information for any generated training data. |
| Dataset Splits | No | The paper states: 'For evaluation, we freeze the current policy and sample 20 random initial states and goals at each epoch (1 epoch = 16 episodes of environment interaction).' This describes an evaluation process, but it does not specify explicit training, validation, and test dataset splits with percentages or counts. |
| Hardware Specification | No | The paper mentions using a 'seven degree of freedom Baxter robot arm with parallel jaw grippers', but this is in the context of the simulated environment, not the hardware used to run the simulations or training. It also discusses computational time for different branching factors ('agent takes 0.4 seconds per episode', '17 seconds', '71 seconds', '286 seconds') and notes that parallelization on GPU 'will render our tree search much more efficient' but does not specify the actual hardware (CPU/GPU models, memory, etc.) used for the experiments. |
| Software Dependencies | No | The paper mentions 'Mu Jo Co simulation environment (Todorov, Erez, and Tassa 2012)' and uses methods like 'Hindsight Experience Replay (Andrychowicz et al. 2017) (HER)' and 'off-policy deep deterministic policy gradients (DDPG) (Lillicrap et al. 2015)'. However, it does not provide specific version numbers for any software libraries, frameworks (e.g., TensorFlow, PyTorch), or the MuJoCo simulator itself. |
| Experiment Setup | Yes | We vary ϵ to be close to 1 in the beginning of training, and linearly decay it to 0.001. ... After unfolding the tree for a prespecified number of steps, we choose the path with the maximum total reward... With ϵ-greedy the agent takes 0.4 seconds per episode (50 steps), with branching factor (bf) equal to 5... all the reported results use bf=5. ... The collected data is used to train deep neural regressors for each skill, a three layer fully connected network that takes as input a state and a goal configurations and predicts the final state reached after skill execution, and the probability of success. ... For evaluation, we freeze the current policy and sample 20 random initial states and goals at each epoch (1 epoch = 16 episodes of environment interaction). |