Transferable Curricula through Difficulty Conditioned Generators
Authors: Sidney Tio, Pradeep Varakantham
IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, we seek to answer the following research questions (RQ): RQ1: How well does PERM represent the environment parameter space with ability and difficulty measures? RQ2: How do RL Agents trained by PERM compare to other UED baselines? We compare two variants of PERM, PERM-Online and PERM-Offline, with the following baselines: PLR (Robust Prioritized Replay, [Jiang et al., 2021a]), PAIRED [Dennis et al., 2020], Domain Randomization(DR, [Tobin et al., 2017]). For all experiments, we train a student PPO agent [Schulman et al., 2017] in Open AI Gym s Lunar Lander and Bipedal Walker [Brockman et al., 2016]. We first evaluate PERM s effectiveness in representing the parameter space on both Open AI environments. Specifically, we evaluate how the latent variables ability a and difficulty d correlate to the rewards obtained in each interaction, as well as its capability in generating environment parameters. We then provide a proof-of-concept of PERM s curriculum generation on the Lunar Lander environment, which has only two environment parameters to tune. Lastly, we scale to the more complex Bipedal Walker environment that has eight environment parameters, and compare the performance of the trained agent against other methods using the same evaluation environment parameters as in Parker-Holder et al [2022]. The results are visualized in Figure 2 and Figure 3, and summary statistics are provided in Table 1. As we see in both plots, the latent representations a (blue) and d (orange) largely correlates with our expectations of its respective relationships with the response variable r. When both ability and difficulty are regressed against the response variable, we achieve a R-squared of 1.00 and 0.986 for Lunar Lander and Bipedal Walker respectively, indicating that both latent representations are perfect predictors of reward achieved by an agent in a given parameterized environment. Turning to PERM s capability in generating environment parameters (Figure 2 & 3, green), we see that PERM achieves near perfect recovery of all environment parameters on the test set, as indicated by the MSE between input parameters and recovered parameters. Taking the strong results of PERM in recovering environment parameters from the latent variables, we proceed to generate curricula to train RL Agents. |
| Researcher Affiliation | Academia | Sidney Tio , Pradeep Varakantham Singapore Management University sidney.tio.2021@phdcs.smu.edu.sg, pradeepv@smu.edu.sg |
| Pseudocode | Yes | Algorithm 1 Curriculum Generation for RL Agents with PERM |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that the source code for their method is publicly available. |
| Open Datasets | Yes | For all experiments, we train a student PPO agent [Schulman et al., 2017] in Open AI Gym s Lunar Lander and Bipedal Walker [Brockman et al., 2016]. |
| Dataset Splits | No | The paper describes separate training and test environments but does not specify numerical splits (e.g., percentages or counts) for a single dataset into training, validation, and test subsets. It mentions periodically evaluating the agent on 'test environments' which are distinct from training, but not a validation split in the traditional sense for dataset partitioning. |
| Hardware Specification | No | The paper does not specify any particular hardware components (e.g., GPU models, CPU types, memory) used for conducting the experiments. |
| Software Dependencies | No | The paper mentions general software like "Open AI Gym" and the use of a "PPO agent" but does not provide specific version numbers for these or any other ancillary libraries or frameworks. |
| Experiment Setup | No | The paper mentions training durations such as "1e6 environment timesteps" and "3 billion environment steps". However, it does not provide specific hyperparameter values (e.g., learning rate, batch size, optimizer details) or detailed training configurations. |