State Aware Imitation Learning

Authors: Yannick Schroecker, Charles L. Isbell

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then evaluate our approach on a tabular domain in section 4.1, comparing our results to a purely supervised approach to imitation learning as well as to sample based inverse reinforcement learning. In section 4.2 we show that SAIL can successfully be applied to learn a neural network policy in a continuous bipedal walker domain and achieves significant improvements over supervised imitation learning in this domain.
Researcher Affiliation Academia Yannick Schroecker College of Computing Georgia Institute of Technology yannickschroecker@gatech.edu Charles Isbell College of Computing Georgia Institute of Technology isbell@cc.gatech.edu
Pseudocode Yes Algorithm 1 State Aware Imitation Learning
Open Source Code No The paper does not provide any explicit statement about open-sourcing the code for the methodology or a link to a code repository.
Open Datasets Yes The second domain we use is a noisy variation of the bipedal walker domain found in Open AI gym[2].
Dataset Splits No The paper describes using a set of 100 episodes from an oracle or a single successful crossing as demonstrations, and then collecting unsupervised episodes. However, it does not specify explicit training/validation/test dataset splits with percentages or counts.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU, GPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions software components such as neural networks and RMSprop, but it does not specify any version numbers for programming languages, libraries, or frameworks used (e.g., Python, TensorFlow, PyTorch).
Experiment Setup Yes At each iteration, 20 unsupervised sample episodes are collected to estimate the SAIL gradient, using plain stochastic gradient descent with a learning rate of 0.1 for the temporal difference update and RMSprop with a a learning rate of 0.01 for updating the policy. [...] To train the network in a purely supervised approach, we use RMSProp over 3000 epochs with a batch size of 128 frames and a learning rate of 10 5. [...] The θ log dπθ-network is trained using RMSprop with a learning rate of 10 4 whereas the policy network is trained using RMSprop and a learning rate of 10 6, starting after the first 1000 episodes.