Deterministic and Discriminative Imitation (D2-Imitation): Revisiting Adversarial Imitation for Sample Efficiency

Authors: Mingfei Sun, Sam Devlin, Katja Hofmann, Shimon Whiteson8378-8385

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical results show that D2-Imitation is effective in achieving good sample efficiency, outperforming several off-policy extension approaches of adversarial imitation on many control tasks.
Researcher Affiliation Collaboration Mingfei Sun1, 2, Sam Devlin2, Katja Hofmann2, Shimon Whiteson1 1University of Oxford 2Microsoft Research {mingfei.sun, shimon.whiteson}@cs.ox.ac.uk, {sam.devlin, katja.hofmann}@microsoft.com
Pseudocode Yes Algorithm 1: D2-Imitation
Open Source Code Yes Code repository: https://github.com/mingfeisun/d2-imitation.
Open Datasets Yes We evaluate D2-Imitation on several popular benchmarks for continuous control. We first use four physics-based control tasks: Swimmer, Hopper, Walker2d and Half Cheetah, ranging from low-dimensional control tasks to difficult highdimensional ones. Each task comes with a true cost function (Brockman et al. 2016).
Dataset Splits No The paper describes collecting 'expert demonstrations' and their use in training, but does not specify formal train/validation/test dataset splits with percentages or sample counts.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., CPU/GPU models, memory) used to run the experiments, beyond a general acknowledgement of 'generous equipment grant from NVIDIA'.
Software Dependencies No The paper mentions several algorithms and optimizers (e.g., SAC, PPO, DDPG, TD3, Adam) and specific network architectures, but does not provide version numbers for programming languages or core software libraries like PyTorch or TensorFlow used for their implementation.
Experiment Setup Yes For the critic and actor networks, we use a two layer MLP with Re Lu activations (tanh activation for the last layer in the actor network) and two hidden units (256+256). For the discriminator we use the same architecture as Ho and Ermon (2016): a two layer MLP with 100 hidden units and tanh activations. These design choices have been empirically shown to be best suited to control tasks. We train all networks with Adam (Kingma and Ba 2014) with a learning rate of 10-3. The discriminator is pretrained for 1000 iterations with batch size 256. We set the probability threshold heuristically to 0.8 (after a sweep from 0.7 to 0.95) as it empirically works well across all domains. In positive buffer B+, we force the off-policy samples to account for only a small portion ( 25%), and the majority is still from demonstrations. The implementation of off-policy TD follows that of TD3 (Fujimoto, van Hoof, and Meger 2018), with a double Q-net. We perform evaluation using 10 random seeds.