Generalized Hindsight for Reinforcement Learning

Authors: Alexander Li, Lerrel Pinto, Pieter Abbeel

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We test our algorithm on several multi-task control tasks, and find that AIR consistently achieves higher asymptotic performance using as few as 20% of the environment interactions as our baselines. We also introduce a computationally more efficient version, which relabels by comparing trajectory rewards to a learned baseline, that also achieves higher asymptotic performance than our baselines. (Section 4 Experimental Evaluation)
Researcher Affiliation Academia Alexander C. Li University of California, Berkeley alexli1@berkeley.edu Lerrel Pinto New York University lerrel@cs.nyu.edu Pieter Abbeel University of California, Berkeley pabbeel@cs.berkeley.edu
Pseudocode Yes Algorithm 1 Generalized Hindsight; Algorithm 2 SIRL: Approximate IRL; Algorithm 3 SA: Trajectory Advantage
Open Source Code No Website: sites.google.com/view/generalized-hindsight (Upon checking the website, it states 'Code: TBA', indicating code is not yet available.)
Open Datasets Yes These environments will be released for open-source access.
Dataset Splits No The paper does not provide specific training/validation/test dataset splits, exact percentages, sample counts, or citations to predefined splits.
Hardware Specification No The paper states, 'We thank AWS for computing resources.' However, it does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running experiments.
Software Dependencies No The paper mentions using Soft Actor-Critic (SAC), Adam optimizer, and OpenAI Gym, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes In our experiments, we simply select m = 1 task out of K = 100 sampled task variables for all environments and both relabeling strategies. [...] We found that a batch size of 256 for all Half Cheetah experiments and 128 for others was optimal. We use the Adam optimizer with a learning rate of 3e-4 for all networks.