Planning Goals for Exploration

Authors: Edward S. Hu, Richard Chang, Oleh Rybkin, Dinesh Jayaraman

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate PEG and other goal-conditioned RL agents on four different continuous-control environments, described below. For each environment, we define an evaluation set of goals as a general principle, we pick evaluation goals in each environment that require extensive exploration in order for the agent to learn a successful evaluation goal reaching policy. ... Figure 4: Performance of agents with different goal setting strategies. All methods are run with 10 seeds. PEG outperforms all baselines, and its performance gain increases with environment difficulty.
Researcher Affiliation Academia GRASP Lab, Department of CIS, University of Pennsylvania {hued, huangkun, oleh, dineshj}@seas.upenn.edu
Pseudocode Yes Algorithm 1 LEXA Training Loop
Open Source Code No To ensure reproducibility, we will release the codebase that contains our method, baselines, and environments.
Open Datasets Yes Walker: In this environment from Tassa et al. (2018)... Ant Maze: We increased exploration difficulty in the Ant Maze environment from MEGA (Pitis et al., 2020)... 3-Block Stacking: This environment is a modification from the Fetch Stack3 environment in Pitis et al. (2020). Point Maze: This environment is taken directly from Pitis et al. (2020) with no modifications.
Dataset Splits No The paper describes a reinforcement learning setup where data is collected through interaction, and defines 'evaluation goals' for testing, but does not provide specific train/validation/test dataset splits in terms of percentages or counts from a static dataset, which are typically found in supervised learning tasks.
Hardware Specification Yes Each seed was run on 1 GPU (Nvidia 2080ti or Nvidia 3090) and 4 CPUs, and required 11GB of GPU memory.
Software Dependencies No The paper mentions using 'Dreamer V2 (Hafner et al., 2021)' and 'LEXA (Mendonca et al., 2021)' but does not provide specific version numbers for these or other software libraries.
Experiment Setup Yes We used the default hyperparameters for training the world model, policies, value functions, and temporal reward functions. For PEG, we tried various values of K for simulating trajectories of πG for each goal and found K = 1 to be sufficient. We use the same Go-explore mechanism across all goal-setting methods: the Go and Explore phases time limits are set to half of the maximum episode length for all environments, while non-Go-explore baselines use the full episode length for exploration. ... For each experiment, we tried weight values of (1, 2, 10) by running 1-2 seeds of PEG for each value. We used a weight of 1, 2, 2, 10 for the 4 experiments respectively . PEG uses MPPI, a sample-based optimizer, to optimize the objective. ... We therefore just choose as many samples (2000 candidates) and rounds (5 optimization rounds) as we can while keeping training time reasonable.