Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Curriculum Reinforcement Learning using Optimal Transport via Gradual Domain Adaptation

Authors: Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, DING ZHAO

NeurIPS 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments in locomotion and manipulation tasks and show that our proposed GRADIENT achieves higher performance than baselines in terms of learning efficiency and asymptotic performance. 5 Experiments
Researcher Affiliation Academia Peide Huang, Mengdi Xu, Jiacheng Zhu, Laixi Shi, Fei Fang, Ding Zhao Carnegie Mellon University Pittsburgh, PA 15213 EMAIL
Pseudocode Yes Algorithm 1: GRAdual Domain adaptation for curriculum re Inforc Ement lear Ning via optimal Transport (GRADIENT) and Algorithm 2: Compute Barycenter
Open Source Code Yes 1Code is available under https://github.com/Peide Huang/gradient.git
Open Datasets Yes In Fetch Push [58], the objective is to use the gripper to push the box to a goal position. The observation space is a 28-dimension vector, including information about the goal. The context is a 2-dimension vector representing the goal position on a surface. (Reference [58] is 'Openai gym. ar Xiv preprint ar Xiv:1606.01540, 2016.')
Dataset Splits No The paper does not explicitly provide training/test/validation dataset splits in the traditional supervised learning sense. Reinforcement learning experiments involve agents interacting with environments, where 'data' is generated through this interaction rather than pre-defined splits. The paper defines 'source task distribution' and 'target task distribution' for generating curricula and evaluates performance on the target task.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., GPU/CPU models, memory, or cloud instance types) used to run the experiments.
Software Dependencies No For the learner, we use the SAC [52] and PPO [53] implementations provided in the Stable Baseline3 library [54]. For the optimal tranport computation, we use POT [55]. (No explicit version numbers for Stable Baseline3 or POT are provided in the text).
Experiment Setup Yes Input: Source task distribution ยต(c), target task distribution ฮฝ(c), interpolation factor ฮฑ, distance metric d, reward threshold G, maximum number of stages K. We then generate curricula using GRADIENT with ฮฑ = 0.2, 0.1, 0.05.