State Alignment-based Imitation Learning
Authors: Fangchen Liu, Zhan Ling, Tongzhou Mu, Hao Su
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To empirically justify our ideas, we conduct experiments in two different settings. We first show that our approach can achieve similar or better results on the standard imitation learning setting, which assumes the same dynamics between the expert and the imitator. We then evaluate our approach in the more challenging setting that the dynamics of the expert and the imitator are different. In a number of control tasks, we either change the physics properties of the imitators or cripple them by changing their geometries. Existing approaches either fail or can only achieve very low rewards, but our approach can still exhibit decent performance. |
| Researcher Affiliation | Academia | Fangchen Liu Zhan Ling Tongzhou Mu Hao Su University of California San Diego La Jolla, CA 92093, USA {fliu,z6ling,t3mu,haosu}@eng.ucsd.edu |
| Pseudocode | Yes | Algorithm 1 SAIL: State Alignment based Imitation Learning |
| Open Source Code | No | The paper does not contain any explicit statement about releasing source code for their proposed method or a link to a code repository. |
| Open Datasets | Yes | We create environments using Mu Jo Co (Todorov et al., 2012) by changing some properties of experts, such as density and geometry of the body. We choose 2 environments, Ant and Swimmer, and augment them to 6 different environments: Heavy/Light/Disabled Ant/Swimmer. The demonstrations are collected from the standard Ant-v2 and Swimmer-v2. More descriptions of the environments and the demonstration collection process can be founded in the Appendix. We use six Mu Jo Co (Todorov ets al., 2012) control tasks. The name and version of the environments are listed in Table 6, which also list the state and action dimension of the tasks with expert performance and reward threshold to indicate the minimum score to solve the task. All the experts are trained by using SAC (Haarnoja et al., 2018) except Swimmer-v2 where TRPO (Schulman et al., 2015) get higher performance. |
| Dataset Splits | No | The paper uses expert demonstration trajectories for training and evaluating performance, but does not specify explicit training/validation/test dataset splits (e.g., percentages or sample counts for each split) in the conventional sense for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like MuJoCo, Soft Actor Critic (SAC), TRPO, and PPO, but does not provide specific version numbers for these or other key software dependencies. |
| Experiment Setup | Yes | When we pretrain the policy network with our methods, we choose β = 0.05 in β-VAE. We use Adam with learning rate 3e-4 as the basic optimization algorithms for all the experiments. The policy network and value network used in the algorithms all use a three-layer relu network with hidden size 256. We choose σ = 0.1 in the policy prior for all the environments. |