TeachMyAgent: a Benchmark for Automatic Curriculum Learning in Deep RL
Authors: Clément Romac, Rémy Portelas, Katja Hofmann, Pierre-Yves Oudeyer
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we identify several key challenges faced by ACL algorithms. Based on these, we present Teach My Agent (TA), a benchmark of current ACL algorithms leveraging procedural task generation. It includes 1) challenge-specific unit-tests using variants of a procedural Box2D bipedal walker environment, and 2) a new procedural Parkour environment combining most ACL challenges, making it ideal for global performance assessment. We then use Teach My Agent to conduct a comparative study of representative existing approaches, showcasing the competitiveness of some ACL algorithms that do not use expert knowledge. |
| Researcher Affiliation | Collaboration | 1Inria, France 2Microsoft Research, UK. Correspondence to: Clement Romac <clement.romac@inria.fr>, Remy Portelas <remy.portelas@inria.fr>. |
| Pseudocode | No | The paper describes the ACL algorithms conceptually but does not include any pseudocode or algorithm blocks in the provided text. |
| Open Source Code | Yes | We open-source our environments, all studied ACL algorithms (collected from open-source code or re-implemented), and DRL students in a Python package available at https://github. com/flowersteam/Teach My Agent. |
| Open Datasets | No | The paper describes procedurally generated environments (Stump Tracks and Parkour) which are part of the benchmark that is open-sourced. It does not provide a pre-collected, static, publicly available dataset in the traditional sense, but rather the tools to generate the data for experiments. |
| Dataset Splits | No | The paper mentions a 'test set' for evaluation but does not explicitly state the use of a 'validation' set or provide details on its split. |
| Hardware Specification | No | Experiments presented in this paper were carried out using 1) the Pla FRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Universit e de Bordeaux, Bordeaux INP and Conseil R egional d Aquitaine (see https://www.plafrim.fr/), 2) the computing facilities MCIA (M esocentre de Calcul Intensif Aquitain) of the Universit e de Bordeaux and of the Universit e de Pau et des Pays de l Adour, and 3) the HPC resources of IDRIS under the allocation 2020-[A0091011996] made by GENCI. While computational facilities are named, no specific hardware components (e.g., GPU models, CPU types, memory) are detailed. |
| Software Dependencies | No | We use Open AI Spinningup s implementation3 for SAC and Open AI Baselines implementation4 for PPO. While the tools are named, specific version numbers for these implementations or other key software libraries are not provided. |
| Experiment Setup | Yes | For both our environments, we train our DRL students for 20 million steps. For each new episode, the teacher samples a new parameter vector used for the procedural generation of the environment. The teacher then receives the cumulative episodic reward that can be potentially turned into a binary reward signal using expert knowledge (as in Goal GAN and Setter-Solver). Additionally, SPDL receives the initial state of the episode as well as the reward obtained at each step, as it is designed for non-episodic RL setup. Every 500000 steps, we test our student on a test set composed of 100 pre-defined tasks and monitor the percentage of test tasks on which the agent obtained an episodic reward greater than 230 (i.e. mastered tasks)... We perform a hyperparameter search for all ACL conditions through grid-search (see appendix A), while controlling that an equivalent number of configurations are tested for each algorithm. See appendix C for additional experimental details. |