Behavioral Cloning from Noisy Demonstrations

Authors: Fumihiro Sasaki, Ryota Yamashina

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, we aim to answer the following three questions: Q1. Does our algorithm improve the learner performance more than BC given the noisy demonstrations? Q2. Can the compounding error due to ζ be reduced as the number of noisy demonstrations increase? Q3. Is our algorithm competitive to the existing IL methods if both annotations associated with the non-optimality and environment interactions are allowed? To answer Q1 and Q2, we evaluated our algorithm against BC on four continuous control tasks that are simulated with Mu Jo Co physics simulator (Todorov et al., 2012).
Researcher Affiliation Industry Fumihiro Sasaki & Ryota Yamashina Ricoh Company, Ltd. {fumihiro.fs.sasaki,ryohta.yamashina}@jp.ricoh.com
Pseudocode Yes Algorithm 1 Behavioral Cloning from Noisy Demonstrations 1: Given the expert demonstrations D. 2: Set ˆR(s, a) = 1 for (s, a) D. 3: Split D into K disjoint sets {D1, D2, ..., DK}. 4: for iteration = 1, M do 5: for k = 1, K do 6: Initialize parameters θk. 7: for l = 1, L do 8: Sample a random minibatch of N state-action pairs (sn, an) from Dk. 9: Calculate a sampled gradient 1 N PN n=1 θk log πθk(sn, an) ˆR(sn, an). 10: Update θk by gradient ascent using the sampled gradient. 11: end for 12: end for 13: Copy πθold πθ. 14: Set ˆR(s, a) = πθold(a|s) for (s, a) D. 15: end for 16: return πθ.
Open Source Code No The paper provides links to publicly available code for *baselines* (IC-GAIL, 2IWIL, T-REX, GAIL, DRIL) but does not provide a link or explicit statement for the code of their *own proposed methodology*.
Open Datasets Yes We train an agent on each task by proximal policy optimization (PPO) algorithm (Schulman et al., 2017) using the rewards defined in the Open AI Gym (Brockman et al., 2016).
Dataset Splits No The paper does not explicitly describe train/validation/test dataset splits for its own collected noisy demonstrations. It mentions using N state-action pairs for training and evaluating policies based on cumulative rewards in an episode, but not data partitioning for validation purposes.
Hardware Specification No The paper mentions using 'Mu Jo Co physics simulator' for continuous control tasks but does not specify any hardware details such as CPU, GPU models, or cloud computing resources used for running simulations or training models.
Software Dependencies No The paper mentions software components like 'proximal policy optimization (PPO) algorithm', 'Open AI Gym', 'Adam' optimizer, and 'Mu Jo Co physics simulator', but it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, MuJoCo x.y.z, TensorFlow/PyTorch x.y.z).
Experiment Setup Yes We implement our algorithm using K neural networks with two hidden layers to represent policies πθ1, πθ2, ..., πθK in the ensemble. The input of the networks is vector representations of the state. Each neural network has 100 hidden units in each hidden layer followed by hyperbolic tangent nonlinearity... We employ Adam (Kingma & Ba, 2014) for learning parameters with a learning rate of η 10−4... The parameters in all layers are initialized by Xavier initialization (Gloriot & Bengio, 2010). The mini-batch size and the number of training epochs are 128 and 500, respectively.