Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data augmentation for efficient learning from parametric experts

Authors: Alexandre Galashov, Josh S. Merel, Nicolas Heess

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the benefit of our method in the context of several existing and widely used algorithms that include policy cloning as a constituent part. Moreover, we highlight the benefits of our approach in two practically relevant settings (a) expert compression, i.e. transfer to a student with fewer parameters; and (b) transfer from privileged experts, i.e. where the expert has a different observation space than the student, usually including access to privileged information. To study how our method performs on complex control domains, we consider three complex, high Do F continuous control tasks: Humanoid Run, Humanoid Walls and Insert Peg.
Researcher Affiliation Industry Alexandre Galashov Deep Mind EMAIL Josh Merel Deep Mind EMAIL Nicolas Heess Deep Mind EMAIL
Pseudocode Yes We illustrate it in Figure 1 and we formulate APC algorithm for BC in Algorithm 1. Algorithm 1 Augmented Policy Cloning (APC)
Open Source Code No Did you include the license to the code and datasets? [No] The code and the data are proprietary.
Open Datasets Yes To study how our method performs on complex control domains, we consider three complex, high Do F continuous control tasks: Humanoid Run, Humanoid Walls and Insert Peg. All these domains are implemented using the Mu Jo Co physics engine [Todorov et al., 2012] and are available in the dm_control repository [Tunyasuvunakool et al., 2020].
Dataset Splits Yes We apply early stopping and select hyperparameters based on the evaluation performance on a validation set.
Hardware Specification No Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [TODO]
Software Dependencies No The paper mentions software like the Mu Jo Co physics engine and algorithms like MPO and VMPO, but it does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes In the subsequent BC experiments, we use ฯƒE = 0.2. Moreover, in order to analyze the noise robustness of the student policy is trained via BC, ฯ€( |s) = N(ยต(s), ฯƒ(s)), we evaluate it by executing the action drawn from a Gaussian with a fixed variance, i.e. a N(ยต(s), ฯƒ), where ฯƒ is the fixed amount of student noise. In all the experiments we use ฯƒ = 0.2.