BAIL: Best-Action Imitation Learning for Batch Deep Reinforcement Learning
Authors: Xinyue Chen, Zijian Zhou, Zheng Wang, Che Wang, Yanqiu Wu, Keith Ross
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For the Mu Jo Co benchmark, we provide a comprehensive experimental study of BAIL, comparing its performance to four other batch Qlearning and imitation-learning schemes for a large variety of batch datasets. Our experiments show that BAIL s performance is much higher than the other schemes, and is also computationally much faster than the batch Q-learning schemes. |
| Researcher Affiliation | Academia | 1 New York University Shanghai 2New York University |
| Pseudocode | Yes | We provide detailed pseudo-code for both BAIL and Progressive BAIL in the supplementary materials. |
| Open Source Code | Yes | We provide public open source code for reproducibility 2. 2https://github.com/lanyavik/BAIL |
| Open Datasets | Yes | Because the BCQ and BEAR codes are publicly available, we are able to make a careful and comprehensive comparison of the performance of BAIL, BCQ, BEAR, MARWIL and vanilla Behavior Cloning (BC) using the Mujoco benchmark. We will also make our datasets publicly available for future benchmarking. |
| Dataset Splits | Yes | We split the data into a training set and validation set. |
| Hardware Specification | No | The paper mentions running experiments "on a CPU node" but does not provide specific details such as CPU model, number of cores, GPU models, or memory specifications. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies or libraries used (e.g., Python version, PyTorch/TensorFlow version, specific Reinforcement Learning frameworks). |
| Experiment Setup | Yes | For a fair comparison, we keep all hyper-parameters fixed for all experiments, instead of fine-tuning for each one. In practice, we find K = 1000 works well for all environments tested. In this paper we use p = 25% for all environments and batches. |