Directed-Info GAIL: Learning Hierarchical Policies from Unsegmented Demonstrations using Directed Information
Authors: Mohit Sharma, Arjun Sharma, Nicholas Rhinehart, Kris M. Kitani
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present results on both discrete and continuous state-action environments. In both of these settings we show that (1) our method is able to segment out sub-tasks from given expert trajectories, (2) learn sub-task conditioned policies, and (3) learn to combine these sub-task policies in order to achieve the task objective. |
| Researcher Affiliation | Academia | Mohit Sharma , Arjun Sharma , Nick Rhinehart, Kris M. Kitani Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213, USA {mohits1,arjuns2,nrhineha,kkitani}@cs.cmu.edu |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | A video of our results on Hopper and Walker environments can be seen at https://sites.google.com/view/directedinfo-gail. |
| Open Datasets | Yes | we also show experiments on Pendulum, Inverted Pendulum, Hopper and Walker environments, provided in Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | We used 25 expert trajectories for the Pendulum and Inverted Pendulum tasks and 50 expert trajectories for experiments with the Hopper and Walker environments. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models. |
| Software Dependencies | No | We used Adam (Kingma & Ba, 2014) as our optimizer setting an initial learning rate of 3e 4. Further, we used the Proximal Policy Optimization algorithm (Schulman et al., 2017) to train our policy network with ϵ = 0.2. |
| Experiment Setup | Yes | Table 3 lists the experiment settings for all of the different environments. We use multi-layer perceptrons for our policy (generator), value, reward (discriminator) and posterior function representations. Each network consisted of 2 hidden layers with 64 units in each layer and Re LU as our non-linearity function. We used Adam (Kingma & Ba, 2014) as our optimizer setting an initial learning rate of 3e 4. Further, we used the Proximal Policy Optimization algorithm (Schulman et al., 2017) to train our policy network with ϵ = 0.2. For the VAE pre-training step we set the VAE learning rate also to 3e 4. For the Gumbel-Softmax distribution we set an initial temperature τ = 5.0. The temperature is annealed using using an exponential decay with the following schedule τ = max(0.1, exp kt), where k = 3e 3 and t is the current epoch. |