Planning Goals for Exploration
Authors: Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate PEG and other goal-conditioned RL agents on four different continuous-control environments, described below. For each environment, we define an evaluation set of goals as a general principle, we pick evaluation goals in each environment that require extensive exploration in order for the agent to learn a successful evaluation goal reaching policy. ... Figure 4: Performance of agents with different goal setting strategies. All methods are run with 10 seeds. PEG outperforms all baselines, and its performance gain increases with environment difficulty. |
| Researcher Affiliation | Academia | GRASP Lab, Department of CIS, University of Pennsylvania {hued, huangkun, oleh, dineshj}@seas.upenn.edu |
| Pseudocode | Yes | Algorithm 1 LEXA Training Loop |
| Open Source Code | No | To ensure reproducibility, we will release the codebase that contains our method, baselines, and environments. |
| Open Datasets | Yes | Walker: In this environment from Tassa et al. (2018)... Ant Maze: We increased exploration difficulty in the Ant Maze environment from MEGA (Pitis et al., 2020)... 3-Block Stacking: This environment is a modification from the Fetch Stack3 environment in Pitis et al. (2020). Point Maze: This environment is taken directly from Pitis et al. (2020) with no modifications. |
| Dataset Splits | No | The paper describes a reinforcement learning setup where data is collected through interaction, and defines 'evaluation goals' for testing, but does not provide specific train/validation/test dataset splits in terms of percentages or counts from a static dataset, which are typically found in supervised learning tasks. |
| Hardware Specification | Yes | Each seed was run on 1 GPU (Nvidia 2080ti or Nvidia 3090) and 4 CPUs, and required 11GB of GPU memory. |
| Software Dependencies | No | The paper mentions using 'Dreamer V2 (Hafner et al., 2021)' and 'LEXA (Mendonca et al., 2021)' but does not provide specific version numbers for these or other software libraries. |
| Experiment Setup | Yes | We used the default hyperparameters for training the world model, policies, value functions, and temporal reward functions. For PEG, we tried various values of K for simulating trajectories of πG for each goal and found K = 1 to be sufficient. We use the same Go-explore mechanism across all goal-setting methods: the Go and Explore phases time limits are set to half of the maximum episode length for all environments, while non-Go-explore baselines use the full episode length for exploration. ... For each experiment, we tried weight values of (1, 2, 10) by running 1-2 seeds of PEG for each value. We used a weight of 1, 2, 2, 10 for the 4 experiments respectively . PEG uses MPPI, a sample-based optimizer, to optimize the objective. ... We therefore just choose as many samples (2000 candidates) and rounds (5 optimization rounds) as we can while keeping training time reasonable. |