What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL?
Authors: Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, Tong Zhang
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In a number of experiments, we observe that weighted imitation learning enjoys better generalization than pessimism-based offline RL method. On a new benchmark containing 9 independent identically distributed (IID) tasks and 17 OOD tasks, GOAT outperforms current state-of-the-art methods by a large margin. |
| Researcher Affiliation | Academia | 1The Hong Kong University of Science and Technology 2Tsinghua University. |
| Pseudocode | Yes | Algorithm 1 GOAT Algorithm |
| Open Source Code | Yes | Code is available at https://github.com/YangRui2015/GOAT |
| Open Datasets | Yes | The introduced benchmark is modified from Mu Jo Co robotic manipulation environments (Plappert et al., 2018). |
| Dataset Splits | No | The paper defines 'training datasets' and 'evaluation tasks' (IID and OOD), and specifies the number of offline datasets and evaluation tasks. However, it does not provide specific percentages or counts for training, validation, or test splits, nor does it refer to standard predefined splits for the experimental data. |
| Hardware Specification | Yes | For our experiments, we use one single GPU (NVIDIA GeForce RTX 2080 Ti 11 GB) and one cpu core (Intel Xeon W-2245 CPU @ 3.90GHz). |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and 'tensorflow' but does not specify their version numbers, which is required for a reproducible description of software dependencies. |
| Experiment Setup | Yes | We use a batch size of 512, a discount factor of γ = 0.98, and an Adam optimizer with learning rate 5 10 4 for all algorithms. We also normalize the observations and goals with estimated mean and standard deviation. The relabel probability prelabel = 1 for most environments except for Slide Left-Right and Slide Near-Far, where prelabel = 0.2 and 0.5, respectively. In EAW, the ratio β is set to 2 and EAW is clipped into range (0, M] for numerical stability, where M is set to 10 in our experiments. For DSW, we utilize a First-In-First-Out (FIFO) queue Ba of size 5 104 to store recent calculated advantage values, and the percentile threshold α gradually increases from 0 to αmax. We use αmax = 80 for all tasks except Hand Reach and Slide Left-Right, and αmax = 50 for Hand Reach, αmax = 0 for Slide Left-Right. When A(s, a, g ) < c and c is the α quantile value of Ba, we set ϵ(A(s, a, g )) = 0.05 instead of 0 following (Yang et al., 2022b). For the uncertainty weight (UW), we use N = 5 ensemble Q networks to calculate the standard deviation Std(s, g) and maintain another FIFO queue Bstd to store recent Std(s, g) values. The Std(s, g) values are then normalized to [0, 1] with the maximum and minimum values in Bstd. Besides, wmin is set to 0.5 and w is searched from {1, 1.5, 2, 2.5}. Regarding the expectile regression (ER), we search τ {0.1, 0.3} because empirical results in Appendix D.5 shows that τ {0.1, 0.3} performs the best. |