reproducibilityindex.ai

Discriminator-Weighted Offline Imitation Learning from Suboptimal Demonstrations

Authors: Haoran Xu, Xianyuan Zhan, Honglei Yin, Huiling Qin

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show that our proposed algorithm achieves higher returns and faster training speed compared to baseline algorithms.
Researcher Affiliation	Collaboration	1JD Technology, Beijing, China 2Institute for AI Industry Research (AIR), Tsinghua University, Beijing, China.
Pseudocode	Yes	In this section, we present the pseudocode of DWBC in Algorithm 1. Algorithm 1 Discriminator-Weighted Behavior Cloning (DWBC)
Open Source Code	Yes	Code is available at https://github.com/ryanxhr/DWBC.
Open Datasets	Yes	We construct experiments on both widely-used D4RL Mu Jo Co datasets (Fu et al., 2020) and more complex Adroit human datasets (Rajeswaran et al., 2017).
Dataset Splits	Yes	In Setting 1, we use mixed datasets in Mujoco environments. We sort from high to low of all trajectories based on the total reward summed over the entire trajectory. We define a trajectory as well-performing if it is among the top 20% of all trajectories. We then sample every Xth trajectory from the well-performing trajectories to constitute De and use the remaining trajectories in the dataset to constitute Do.
Hardware Specification	Yes	In this paper, all experiments are implemented with Tensorflow and executed on NVIDIA V100 GPUs.
Software Dependencies	No	In this paper, all experiments are implemented with Tensorflow and executed on NVIDIA V100 GPUs. (Tensorflow is mentioned, but without a specific version number).
Experiment Setup	Yes	For all function approximators, we use fully connected neural networks with RELU activations. For policy networks, we use tanh (Gaussian) on outputs. We use Adam for all optimizers. The batch size is 256 and γ is 0.99. ... The policy network is 3-layer MLP with 256 hidden units in each layer. ... The learning rate for the policy is 1e 5 and the learning rate for the discriminator network is 1e 4. We search α in {1, 2, 5, 10} for best model performance. We clip the output of d to [0.1, 0.9]. We set η to 0.5 across all tasks...