Hindsight Foresight Relabeling for Meta-Reinforcement Learning

Authors: Michael Wan, Jian Peng, Tanmay Gangwani

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find that HFR improves performance when compared to other relabeling methods on a variety of meta-RL tasks. We evaluate on a set of both sparse and dense reward Mu Jo Co environments (Todorov et al., 2012) modeled in Open AI Gym (Brockman et al., 2016). Figure 6 plots the performance (average returns or success-rate) on the held-out meta-test tasks on the y-axis, with the total timesteps of environment interaction for meta-training on the x-axis.
Researcher Affiliation Academia Michael Wan, Jian Peng & Tanmay Gangwani University of Illinois at Urbana-Champaign
Pseudocode Yes Algorithm 1: Hindsight Foresight Relabeling (HFR). Algorithm 2: Computation of the utility function based on the Bellman error (Eq. 10), for PEARL-based meta-RL.
Open Source Code Yes Code: https://www.github.com/michaelwan11/hfr
Open Datasets Yes We evaluate on a set of both sparse and dense reward Mu Jo Co environments (Todorov et al., 2012) modeled in Open AI Gym (Brockman et al., 2016). Ant-Goal: We use the Ant-Goal task from (Gupta et al., 2018). Ant-Vel: We use the Ant environment from Open AI gym. Cheetah-Highdim: We take the Cheetah-Highdim task from (Lin et al., 2020).
Dataset Splits No The paper provides details on 'Train Tasks' and 'Test Tasks' in Table 2, but no explicit mention of a separate validation split for hyperparameter tuning or early stopping. Standard practice in RL often combines validation with training or uses separate test sets after training.
Hardware Specification No The paper does not explicitly mention the specific hardware (e.g., GPU models, CPU types, memory) used for running the experiments. It only refers to 'Mu Jo Co environments' and 'Open AI Gym'.
Software Dependencies No Table 1 lists PEARL hyperparameters such as 'Nonlinearity Re LU', 'Optimizer Adam', but does not provide specific version numbers for software dependencies like Python, PyTorch, TensorFlow, or other libraries. It only mentions using 'PEARL' as the base algorithm.
Experiment Setup Yes Table 1: PEARL hypeparameters used for all experiments. Hyperparameter Value: Nonlinearity Re LU, Optimizer Adam, Policy Learning Rate 3e 4, Q-function Learning Rate 3e 4, Batch Size 256, Replay Buffer Size 1e6. Table 2: Environment Details includes: Discount, Horizon, Train Tasks, Test Tasks, Number of Exploration Steps.