Curriculum Reinforcement Learning via Constrained Optimal Transport

Authors: Pascal Klink, Haoyi Yang, Carlo D’Eramo, Jan Peters, Joni Pajarinen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6. Experiments The experiments in this section serve to validate the identified benefits of the proposed interpolation-based CRL method, which we will refer to as CURROT. We proceed by showing that the method can generate curricula for different target distributions µ(c) while avoiding problems arising from parametric restrictions on the context distribution that e.g. SPRL imposes. We also show that even in scenarios with target distributions uniformly covering C, the proposed method significantly improves over previous evaluations of interpolation-based CRL methods, matching and surpassing the performance of best performing methods so far.
Researcher Affiliation Academia 1Intelligent Autonomous Systems, Technical University of Darmstadt, Germany 2Department of Electrical Engineering and Automation, Aalto University, Finland.
Pseudocode Yes Algorithm 1 Curricula via Optimal Transport (CURROT)
Open Source Code Yes Code is provided under: https://github.com/psclklnk/currot.
Open Datasets No The paper describes custom or modified environments (Sparse Goal Reaching, Point Mass, Bipedal Walker Stump Tracks) in which agents are trained, rather than using or providing access to a static, publicly available dataset. No specific link, DOI, or repository for a dataset is provided.
Dataset Splits No The paper does not explicitly provide details about training, validation, and test dataset splits. In reinforcement learning, data is typically generated through interaction with the environment, and the concept of fixed dataset splits as in supervised learning is not directly applicable in the same way.
Hardware Specification Yes In our experiments, we used 200 samples for 2-D spaces and 500 samples for 3-D spaces, leading to solving times of less than 100ms with the linear sum assignment function of the Sci Py library (Virtanen et al., 2020) on an AMD Ryzen 9 3900X.
Software Dependencies Yes In our implementation, we use the Gurobi optimization software (Gurobi Optimization, LLC, 2021) to solve the above problem. For example, the Geom Loss library (Feydy & Roussillon, 2019), that we use in our implementations... As RL agents, we use SAC and PPO implemented in the Stable Baselines 3 library (Raffin et al., 2021)...
Experiment Setup Yes This section discusses hyperparameters and additional details of the conducted experiments that could not be provided in the main text due to space limitations. E.1. Algorithm Hyperparameters The two main parameters of the SPRL algorithms are the performance threshold δ as well as the allowed distance between subsequent distributions ϵ. ... Table 1 shows the parameters of CURROT and SPRL for the different environments. For ALP-GMM, the relevant hyperparameters are the percentage of random samples drawn from the context space prand, the number of completed learning episodes between the update of the context distribution nrollout as well as the maximum buffer size of past trajectories to keep sbuffer. ... We use the SAC algorithm for learning in this task. Compared to the default algorithm parameters of Stable Baselines 3, we only changed the policy update frequency to 5 environment steps, increased the batch size to 512 and reduced the buffer size to 200.000 steps.