Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations
Authors: Haoran Xu, Xianyuan Zhan, Honglei Yin, Huiling Qin
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that our proposed algorithm achieves higher returns and faster training speed compared to baseline algorithms. |
| Researcher Affiliation | Collaboration | 1JD Technology, Beijing, China 2Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China. |
| Pseudocode | Yes | In this section, we present the pseudocode of DWBC in Algorithm 1. Algorithm 1 Discriminator-Weighted Behavior Cloning (DWBC) |
| Open Source Code | Yes | Code is available at https://github.com/ryanxhr/DWBC. |
| Open Datasets | Yes | We construct experiments on both widely-used D4RL Mu Jo Co datasets (Fu et al., 2020) and more complex Adroit human datasets (Rajeswaran et al., 2017). |
| Dataset Splits | Yes | In Setting 1, we use mixed datasets in Mujoco environments. We sort from high to low of all trajectories based on the total reward summed over the entire trajectory. We define a trajectory as well-performing if it is among the top 20% of all trajectories. We then sample every Xth trajectory from the well-performing trajectories to constitute De and use the remaining trajectories in the dataset to constitute Do. |
| Hardware Specification | Yes | In this paper, all experiments are implemented with Tensorflow and executed on NVIDIA V100 GPUs. |
| Software Dependencies | No | In this paper, all experiments are implemented with Tensorflow and executed on NVIDIA V100 GPUs. (Tensorflow is mentioned, but without a specific version number). |
| Experiment Setup | Yes | For all function approximators, we use fully connected neural networks with RELU activations. For policy networks, we use tanh (Gaussian) on outputs. We use Adam for all optimizers. The batch size is 256 and γ is 0.99. ... The policy network is 3-layer MLP with 256 hidden units in each layer. ... The learning rate for the policy is 1e 5 and the learning rate for the discriminator network is 1e 4. We search α in {1, 2, 5, 10} for best model performance. We clip the output of d to [0.1, 0.9]. We set η to 0.5 across all tasks... |