Diversify \& Conquer: Outcome-directed Curriculum RL via Out-of-Distribution Disagreement

Authors: Daesol Cho, Seungjae Lee, H. Jin Kim

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present experimental results demonstrating that D2C outperforms prior curriculum RL methods in both quantitative and qualitative aspects, even with the arbitrarily distributed desired outcome examples. ... 5 Experiment: We conduct experiments on 6 environments that have multi-modal desired outcome distribution to validate our proposed method.
Researcher Affiliation Academia Daesol Cho, Seungjae Lee, and H. Jin Kim Seoul National University Automation and Systems Research Institute (ASRI) Artificial Intelligence Institute of Seoul National University (AIIS) dscho1234@snu.ac.kr, ysz0301@snu.ac.kr, hjinkim@snu.ac.kr
Pseudocode Yes Algorithm 1 Overview of D2C algorithm (in Section B Algorithm and E Algorithm)
Open Source Code No The paper mentions links to implementations of baseline methods (e.g., 'OUTPACE [5]: We follow the default setting in the original implementation from https: //github.com/jay LEE0301/outpace_official.'), but it does not provide a link or an explicit statement about the public availability of the source code for their proposed method (D2C).
Open Datasets Yes We conduct experiments on 6 environments that have multi-modal desired outcome distribution... We referred to the metaworld [48] and EARL [40] environments. ... The initial state of the agent is [0, 0] and the desired outcome states are obtained from the default goal points [8, 16], [ 8, 16], [16, 8], [ 16, 8].
Dataset Splits No The paper describes training procedures and uses terms like 'evaluation success rates' and 'ablation study' to assess performance and sensitivity to hyperparameters. However, in the context of reinforcement learning experiments, it does not specify explicit static training/validation/test dataset splits with percentages or sample counts in the way supervised learning tasks typically do. Data is generated dynamically through environment interaction, and evaluation is performed on the environments rather than pre-defined validation sets.
Hardware Specification Yes We used NVIDIA A5000 GPU and AMD Ryzen Threadripper 3960X for training, and each experiment took about 1 2 days for training.
Software Dependencies No The paper states 'D2C and all the baselines are trained by SAC [13]' and mentions 'optimizer adam'. However, it does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or other relevant libraries required for reproducibility.
Experiment Setup Yes Table 2: Hyperparameters for D2C (e.g., critic hidden dim 512, discount factor γ 0.99, batch size 512, learning rate for fi 1e-3, learning rate for Critic & Actor 1e-4, optimizer adam). Table 3: Default env-specific hyperparameters for D2C (e.g., # of heads, λ, ϵ, fi update freq (step), fi # of iteration per update, max episode horizon).