Sample Efficient Imitation Learning for Continuous Control
Authors: Fumihiro Sasaki, Tetsuya Yohira, Atsuo Kawaguchi
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our algorithm achieves competitive results with GAIL while significantly reducing the environment interactions. |
| Researcher Affiliation | Industry | Fumihiro Sasaki, Tetsuya Yohira & Atsuo Kawaguchi Ricoh Company, Ltd. {fumihiro.fs.sasaki,tetsuya.yohira,atsuo.kawaguchi}@jp.ricoh.com |
| Pseudocode | Yes | Algorithm 1 Overview of our IL algorithm |
| Open Source Code | No | The information is insufficient. The paper states, 'We use publicly available code (https://github.com/openai/imitation) for the implementation of GAIL and BC,' which refers to third-party code, not the authors' own code for their proposed algorithm. |
| Open Datasets | Yes | In our experiments, we aim to answer the following three questions: We use five physics-based control tasks that are simulated with Mu Jo Co physics simulator (Todorov et al., 2012). We train an agent on each task by TRPO (Schulman et al., 2015a) using the rewards defined in the Open AI Gym (Brockman et al., 2016) |
| Dataset Splits | No | The information is insufficient. While the paper mentions 'validation rollouts' and describes 'sparse sampling setup' and 'dense sampling setup' for data generation, it does not provide explicit dataset splits (e.g., percentages or fixed counts) for training, validation, and test sets in a static dataset context. |
| Hardware Specification | Yes | All experiments are run on a PC with a 3.30 GHz Intel Core i7-5820k Processor, a Ge Force GTX Titan GPU, and 32GB of RAM. |
| Software Dependencies | No | The information is insufficient. The paper mentions using RMSProp and refers to external code for baselines, but does not provide specific software dependencies or library names with version numbers for their own implementation. |
| Experiment Setup | Yes | PN has 100 hidden units in each hidden layer, and its final output is followed by hyperbolic tangent nonlinearity to bound its action range. QN has 500 hidden units in each hidden layer and a single output is followed by sigmoid nonlinearity to bound the output between [0,1]. All hidden layers are followed by leaky rectified nonlinearity (Maas et al., 2013). The parameters in all layers are initialized by Xavier initialization (Glorot & Bengio, 2010). The input of PN is given by concatenated vector representations for the state s and noise z. The noise vector, of which dimensionality corresponds to that of the state vector, generated by zero-mean normal distribution so that z Pz = N(0, 1). The input of QN is given by concatenated vector representations for the state s and action a. We employ RMSProp (Hinton et al., 2012) for learning parameters with a decay rate 0.995 and epsilon 10 8 . The learning rates are initially set to 10 3 for QN and 10 4 for PN, respectively. The target QN with parameters ν are updated so that ν = 10 3 ν+(1 10 3) ν at each update of ν. We linearly decrease the learning rates as the training proceeds. We set minibatch size of (st, at, st+1) triplets 64, the replay buffer size |Bβ| = 15000, and the discount factor γ = 0.85. We sample 128 noise vectors for calculating empirical expectation Ez Pz of the gradient (6). |