Generalized Hindsight for Reinforcement Learning
Authors: Alexander Li, Lerrel Pinto, Pieter Abbeel
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We test our algorithm on several multi-task control tasks, and find that AIR consistently achieves higher asymptotic performance using as few as 20% of the environment interactions as our baselines. We also introduce a computationally more efficient version, which relabels by comparing trajectory rewards to a learned baseline, that also achieves higher asymptotic performance than our baselines. (Section 4 Experimental Evaluation) |
| Researcher Affiliation | Academia | Alexander C. Li University of California, Berkeley alexli1@berkeley.edu Lerrel Pinto New York University lerrel@cs.nyu.edu Pieter Abbeel University of California, Berkeley pabbeel@cs.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Generalized Hindsight; Algorithm 2 SIRL: Approximate IRL; Algorithm 3 SA: Trajectory Advantage |
| Open Source Code | No | Website: sites.google.com/view/generalized-hindsight (Upon checking the website, it states 'Code: TBA', indicating code is not yet available.) |
| Open Datasets | Yes | These environments will be released for open-source access. |
| Dataset Splits | No | The paper does not provide specific training/validation/test dataset splits, exact percentages, sample counts, or citations to predefined splits. |
| Hardware Specification | No | The paper states, 'We thank AWS for computing resources.' However, it does not specify any particular GPU models, CPU models, or other detailed hardware specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using Soft Actor-Critic (SAC), Adam optimizer, and OpenAI Gym, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | In our experiments, we simply select m = 1 task out of K = 100 sampled task variables for all environments and both relabeling strategies. [...] We found that a batch size of 256 for all Half Cheetah experiments and 128 for others was optimal. We use the Adam optimizer with a learning rate of 3e-4 for all networks. |