Diversify \& Conquer: Outcome-directed Curriculum RL via Out-of-Distribution Disagreement
Authors: Daesol Cho, Seungjae Lee, H. Jin Kim
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present experimental results demonstrating that D2C outperforms prior curriculum RL methods in both quantitative and qualitative aspects, even with the arbitrarily distributed desired outcome examples. ... 5 Experiment: We conduct experiments on 6 environments that have multi-modal desired outcome distribution to validate our proposed method. |
| Researcher Affiliation | Academia | Daesol Cho, Seungjae Lee, and H. Jin Kim Seoul National University Automation and Systems Research Institute (ASRI) Artificial Intelligence Institute of Seoul National University (AIIS) dscho1234@snu.ac.kr, ysz0301@snu.ac.kr, hjinkim@snu.ac.kr |
| Pseudocode | Yes | Algorithm 1 Overview of D2C algorithm (in Section B Algorithm and E Algorithm) |
| Open Source Code | No | The paper mentions links to implementations of baseline methods (e.g., 'OUTPACE [5]: We follow the default setting in the original implementation from https: //github.com/jay LEE0301/outpace_official.'), but it does not provide a link or an explicit statement about the public availability of the source code for their proposed method (D2C). |
| Open Datasets | Yes | We conduct experiments on 6 environments that have multi-modal desired outcome distribution... We referred to the metaworld [48] and EARL [40] environments. ... The initial state of the agent is [0, 0] and the desired outcome states are obtained from the default goal points [8, 16], [ 8, 16], [16, 8], [ 16, 8]. |
| Dataset Splits | No | The paper describes training procedures and uses terms like 'evaluation success rates' and 'ablation study' to assess performance and sensitivity to hyperparameters. However, in the context of reinforcement learning experiments, it does not specify explicit static training/validation/test dataset splits with percentages or sample counts in the way supervised learning tasks typically do. Data is generated dynamically through environment interaction, and evaluation is performed on the environments rather than pre-defined validation sets. |
| Hardware Specification | Yes | We used NVIDIA A5000 GPU and AMD Ryzen Threadripper 3960X for training, and each experiment took about 1 2 days for training. |
| Software Dependencies | No | The paper states 'D2C and all the baselines are trained by SAC [13]' and mentions 'optimizer adam'. However, it does not provide specific version numbers for software dependencies such as Python, PyTorch, TensorFlow, or other relevant libraries required for reproducibility. |
| Experiment Setup | Yes | Table 2: Hyperparameters for D2C (e.g., critic hidden dim 512, discount factor γ 0.99, batch size 512, learning rate for fi 1e-3, learning rate for Critic & Actor 1e-4, optimizer adam). Table 3: Default env-specific hyperparameters for D2C (e.g., # of heads, λ, ϵ, fi update freq (step), fi # of iteration per update, max episode horizon). |