Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation
Authors: Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, DING ZHAO
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance. 5 Experiments |
| Researcher Affiliation | Academia | Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao Carnegie Mellon University Pittsburgh, PA 15213 {peideh, mengdixu, jzhu4, laixis, feifang, dingzhao}@andrew.cmu.edu |
| Pseudocode | Yes | Algorithm 1: GRAdual Domain adaptation for curriculum re Inforc Ement lear Ning via optimal Transport (GRADIENT) and Algorithm 2: Compute Barycenter |
| Open Source Code | Yes | 1Code is available under https://github.com/Peide Huang/gradient.git |
| Open Datasets | Yes | In Fetch Push [58], the objective is to use the gripper to push the box to a goal position. The observation space is a 28-dimension vector, including information about the goal. The context is a 2-dimension vector representing the goal position on a surface. (Reference [58] is 'Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.') |
| Dataset Splits | No | The paper does not explicitly provide training/test/validation dataset splits in the traditional supervised learning sense. Reinforcement learning experiments involve agents interacting with environments, where 'data' is generated through this interaction rather than pre-defined splits. The paper defines 'source task distribution' and 'target task distribution' for generating curricula and evaluates performance on the target task. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments. |
| Software Dependencies | No | For the learner, we use the SAC [52] and PPO [53] implementations provided in the Stable Baseline3 library [54]. For the optimal tranport computation, we use POT [55]. (No explicit version numbers for Stable Baseline3 or POT are provided in the text). |
| Experiment Setup | Yes | Input: Source task distribution µ(c), target task distribution ν(c), interpolation factor α, distance metric d, reward threshold G, maximum number of stages K. We then generate curricula using GRADIENT with α = 0.2, 0.1, 0.05. |