Behavioral Cloning from Noisy Demonstrations
Authors: Fumihiro Sasaki, Ryota Yamashina
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we aim to answer the following three questions: Q1. Does our algorithm improve the learner performance more than BC given the noisy demonstrations? Q2. Can the compounding error due to ζ be reduced as the number of noisy demonstrations increase? Q3. Is our algorithm competitive to the existing IL methods if both annotations associated with the non-optimality and environment interactions are allowed? To answer Q1 and Q2, we evaluated our algorithm against BC on four continuous control tasks that are simulated with Mu Jo Co physics simulator (Todorov et al., 2012). |
| Researcher Affiliation | Industry | Fumihiro Sasaki & Ryota Yamashina Ricoh Company, Ltd. {fumihiro.fs.sasaki,ryohta.yamashina}@jp.ricoh.com |
| Pseudocode | Yes | Algorithm 1 Behavioral Cloning from Noisy Demonstrations 1: Given the expert demonstrations D. 2: Set ˆR(s, a) = 1 for (s, a) D. 3: Split D into K disjoint sets {D1, D2, ..., DK}. 4: for iteration = 1, M do 5: for k = 1, K do 6: Initialize parameters θk. 7: for l = 1, L do 8: Sample a random minibatch of N state-action pairs (sn, an) from Dk. 9: Calculate a sampled gradient 1 N PN n=1 θk log πθk(sn, an) ˆR(sn, an). 10: Update θk by gradient ascent using the sampled gradient. 11: end for 12: end for 13: Copy πθold πθ. 14: Set ˆR(s, a) = πθold(a|s) for (s, a) D. 15: end for 16: return πθ. |
| Open Source Code | No | The paper provides links to publicly available code for *baselines* (IC-GAIL, 2IWIL, T-REX, GAIL, DRIL) but does not provide a link or explicit statement for the code of their *own proposed methodology*. |
| Open Datasets | Yes | We train an agent on each task by proximal policy optimization (PPO) algorithm (Schulman et al., 2017) using the rewards defined in the Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper does not explicitly describe train/validation/test dataset splits for its own collected noisy demonstrations. It mentions using N state-action pairs for training and evaluating policies based on cumulative rewards in an episode, but not data partitioning for validation purposes. |
| Hardware Specification | No | The paper mentions using 'Mu Jo Co physics simulator' for continuous control tasks but does not specify any hardware details such as CPU, GPU models, or cloud computing resources used for running simulations or training models. |
| Software Dependencies | No | The paper mentions software components like 'proximal policy optimization (PPO) algorithm', 'Open AI Gym', 'Adam' optimizer, and 'Mu Jo Co physics simulator', but it does not provide specific version numbers for these software dependencies (e.g., Python 3.x, MuJoCo x.y.z, TensorFlow/PyTorch x.y.z). |
| Experiment Setup | Yes | We implement our algorithm using K neural networks with two hidden layers to represent policies πθ1, πθ2, ..., πθK in the ensemble. The input of the networks is vector representations of the state. Each neural network has 100 hidden units in each hidden layer followed by hyperbolic tangent nonlinearity... We employ Adam (Kingma & Ba, 2014) for learning parameters with a learning rate of η 10−4... The parameters in all layers are initialized by Xavier initialization (Gloriot & Bengio, 2010). The mini-batch size and the number of training epochs are 128 and 500, respectively. |