Goal-conditioned Imitation Learning
Authors: Yiming Ding, Carlos Florensa, Pieter Abbeel, Mariano Phielipp
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate these questions in four different simulated robotic goal-conditioned tasks that are detailed in the next subsection along with the performance metric used throughout the experiments section. All the results use 20 demonstrations reaching uniformly sampled goals. All curves have 5 random seeds and the shaded area is one standard deviation. |
| Researcher Affiliation | Collaboration | Yiming Ding Department of Computer Science University of California, Berkeley dingyiming0427@berkeley.edu Carlos Florensa Department of Computer Science University of California, Berkeley florensa@berkeley.edu Mariano Phielipp Intel AI Labs mariano.j.phielipp@intel.com Pieter Abbeel Department of Computer Science University of California, Berkeley pabbeel@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Goal-conditioned GAIL with Hindsight: goal GAIL |
| Open Source Code | Yes | Our code is open-source 2. 2https://sites.google.com/view/goalconditioned-il/ |
| Open Datasets | No | Experiments are conducted in four continuous environments in Mu Jo Co [41]. The paper describes custom simulated environments rather than referring to or providing access to a pre-existing public dataset. |
| Dataset Splits | No | The paper does not specify explicit training/validation/test dataset splits, percentages, or absolute sample counts for data used in experiments. |
| Hardware Specification | No | The paper does not mention any specific hardware specifications (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions |
| Experiment Setup | Yes | In the four environments used in our experiments, i.e. Four Rooms environment, Fetch Pick & Place, Pointmass block pusher and Fetch Stack Two, the task horizons are set to 300, 100, 100 and 150 respectively. The discount factors are γ = 1 1 H . In all experiments, the Q function, policy and discriminator are paramaterized by fully connected neural networks with two hidden layers of size 256. DDPG is used for policy optimization and hindsight probability is set to p = 0.8. The initial value of the behavior cloning loss weight β is set to 0.1 and is annealed by 0.9 per 250 rollouts collected. The initial value of the discriminator reward weight δGAIL is set to 0.1. |