Imitation Learning by Reinforcement Learning

Authors: Kamil Ciosek

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments which confirm that our reduction works well in practice for continuous control tasks. ... In Section 6, we empirically evaluate the performance of the reduction as the amount of available expert data varies.
Researcher Affiliation Industry Kamil Ciosek Spotify kamilc@spotify.com
Pseudocode Yes Algorithm 1 Imitation Learning by Reinforcement Learning (ILR) Require: expert dataset D, ENVIRONMENT (without access to extrinsic reward) Rint(s, a) 1 {(s, a) D} πI RL-SOLVER(ENVIRONMENT, Rint)
Open Source Code Yes Our main contribution is Proposition 1, for which we provided a complete proof. The external results we rely on are generic properties of the Total Variation distance and of Markov chain mixing1, for all of which we have provided references. Our experimental setup is described in detail in Appendix A. Moreover, we make the source code available.
Open Datasets Yes We use the Hopper, Ant, Walker and Half Cheetah continuous control environments from the Py Bullet gym suite (Ellenberger, 2018 2019). The expert policy is obtained by training a SAC agent for 2 million steps.
Dataset Splits No The paper discusses using 'expert dataset' with varying 'amount of data' measured in 'episodes', but it does not specify explicit training, validation, and test splits for the datasets themselves like '80/10/10 split' or specific sample counts for each split.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions software like 'SAC' and 'Py Bullet gym suite' with a citation (Ellenberger, 2018-2019), and lists optimizers like 'Adam' within hyperparameters. However, it does not specify version numbers for Python, PyTorch, TensorFlow, or the PyBullet library itself, which are necessary for reproducible software dependencies.
Experiment Setup Yes The hyperparameters of SAC are given in the table below. hyperparameter value actor optimizer Adam actor learning rate 3e-4 critic optimizer Adam critic learning rate 3e-4 batch size 100 update rate for target network (tau) 0.005 γ 0.99 The policy network and the critic network have two fully connected middle layers with 256 neurons each followed by Re LU non-linearities.