Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation

Authors: Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, DING ZHAO

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance. 5 Experiments
Researcher Affiliation Academia Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao Carnegie Mellon University Pittsburgh, PA 15213 {peideh, mengdixu, jzhu4, laixis, feifang, dingzhao}@andrew.cmu.edu
Pseudocode Yes Algorithm 1: GRAdual Domain adaptation for curriculum re Inforc Ement lear Ning via optimal Transport (GRADIENT) and Algorithm 2: Compute Barycenter
Open Source Code Yes 1Code is available under https://github.com/Peide Huang/gradient.git
Open Datasets Yes In Fetch Push [58], the objective is to use the gripper to push the box to a goal position. The observation space is a 28-dimension vector, including information about the goal. The context is a 2-dimension vector representing the goal position on a surface. (Reference [58] is 'Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.')
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits in the traditional supervised learning sense. Reinforcement learning experiments involve agents interacting with environments, where 'data' is generated through this interaction rather than pre-defined splits. The paper defines 'source task distribution' and 'target task distribution' for generating curricula and evaluates performance on the target task.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies No For the learner, we use the SAC [52] and PPO [53] implementations provided in the Stable Baseline3 library [54]. For the optimal tranport computation, we use POT [55]. (No explicit version numbers for Stable Baseline3 or POT are provided in the text).
Experiment Setup Yes Input: Source task distribution µ(c), target task distribution ν(c), interpolation factor α, distance metric d, reward threshold G, maximum number of stages K. We then generate curricula using GRADIENT with α = 0.2, 0.1, 0.05.