TeachMyAgent: a Benchmark for Automatic Curriculum Learning in Deep RL

Authors: Clément Romac, Rémy Portelas, Katja Hofmann, Pierre-Yves Oudeyer

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we identify several key challenges faced by ACL algorithms. Based on these, we present Teach My Agent (TA), a benchmark of current ACL algorithms leveraging procedural task generation. It includes 1) challenge-specific unit-tests using variants of a procedural Box2D bipedal walker environment, and 2) a new procedural Parkour environment combining most ACL challenges, making it ideal for global performance assessment. We then use Teach My Agent to conduct a comparative study of representative existing approaches, showcasing the competitiveness of some ACL algorithms that do not use expert knowledge.
Researcher Affiliation Collaboration 1Inria, France 2Microsoft Research, UK. Correspondence to: Clement Romac <clement.romac@inria.fr>, Remy Portelas <remy.portelas@inria.fr>.
Pseudocode No The paper describes the ACL algorithms conceptually but does not include any pseudocode or algorithm blocks in the provided text.
Open Source Code Yes We open-source our environments, all studied ACL algorithms (collected from open-source code or re-implemented), and DRL students in a Python package available at https://github. com/flowersteam/Teach My Agent.
Open Datasets No The paper describes procedurally generated environments (Stump Tracks and Parkour) which are part of the benchmark that is open-sourced. It does not provide a pre-collected, static, publicly available dataset in the traditional sense, but rather the tools to generate the data for experiments.
Dataset Splits No The paper mentions a 'test set' for evaluation but does not explicitly state the use of a 'validation' set or provide details on its split.
Hardware Specification No Experiments presented in this paper were carried out using 1) the Pla FRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Universit e de Bordeaux, Bordeaux INP and Conseil R egional d Aquitaine (see https://www.plafrim.fr/), 2) the computing facilities MCIA (M esocentre de Calcul Intensif Aquitain) of the Universit e de Bordeaux and of the Universit e de Pau et des Pays de l Adour, and 3) the HPC resources of IDRIS under the allocation 2020-[A0091011996] made by GENCI. While computational facilities are named, no specific hardware components (e.g., GPU models, CPU types, memory) are detailed.
Software Dependencies No We use Open AI Spinningup s implementation3 for SAC and Open AI Baselines implementation4 for PPO. While the tools are named, specific version numbers for these implementations or other key software libraries are not provided.
Experiment Setup Yes For both our environments, we train our DRL students for 20 million steps. For each new episode, the teacher samples a new parameter vector used for the procedural generation of the environment. The teacher then receives the cumulative episodic reward that can be potentially turned into a binary reward signal using expert knowledge (as in Goal GAN and Setter-Solver). Additionally, SPDL receives the initial state of the episode as well as the reward obtained at each step, as it is designed for non-episodic RL setup. Every 500000 steps, we test our student on a test set composed of 100 pre-defined tasks and monitor the percentage of test tasks on which the agent obtained an episodic reward greater than 230 (i.e. mastered tasks)... We perform a hyperparameter search for all ACL conditions through grid-search (see appendix A), while controlling that an equivalent number of configurations are tested for each algorithm. See appendix C for additional experimental details.