Data augmentation for efficient learning from parametric experts

Authors: Alexandre Galashov, Josh S. Merel, Nicolas Heess

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the benefit of our method in the context of several existing and widely used algorithms that include policy cloning as a constituent part. Moreover, we highlight the benefits of our approach in two practically relevant settings (a) expert compression, i.e. transfer to a student with fewer parameters; and (b) transfer from privileged experts, i.e. where the expert has a different observation space than the student, usually including access to privileged information. To study how our method performs on complex control domains, we consider three complex, high Do F continuous control tasks: Humanoid Run, Humanoid Walls and Insert Peg.
Researcher Affiliation Industry Alexandre Galashov Deep Mind agalashov@deepmind.com Josh Merel Deep Mind jsmerel@gmail.com Nicolas Heess Deep Mind heess@deepmind.com
Pseudocode Yes We illustrate it in Figure 1 and we formulate APC algorithm for BC in Algorithm 1. Algorithm 1 Augmented Policy Cloning (APC)
Open Source Code No Did you include the license to the code and datasets? [No] The code and the data are proprietary.
Open Datasets Yes To study how our method performs on complex control domains, we consider three complex, high Do F continuous control tasks: Humanoid Run, Humanoid Walls and Insert Peg. All these domains are implemented using the Mu Jo Co physics engine [Todorov et al., 2012] and are available in the dm_control repository [Tunyasuvunakool et al., 2020].
Dataset Splits Yes We apply early stopping and select hyperparameters based on the evaluation performance on a validation set.
Hardware Specification No Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [TODO]
Software Dependencies No The paper mentions software like the Mu Jo Co physics engine and algorithms like MPO and VMPO, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In the subsequent BC experiments, we use σE = 0.2. Moreover, in order to analyze the noise robustness of the student policy is trained via BC, π( |s) = N(µ(s), σ(s)), we evaluate it by executing the action drawn from a Gaussian with a fixed variance, i.e. a N(µ(s), σ), where σ is the fixed amount of student noise. In all the experiments we use σ = 0.2.