What is Essential for Unseen Goal Generalization of Offline Goal-conditioned RL?

Authors: Rui Yang, Lin Yong, Xiaoteng Ma, Hao Hu, Chongjie Zhang, Tong Zhang

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In a number of experiments, we observe that weighted imitation learning enjoys better generalization than pessimism-based offline RL method. On a new benchmark containing 9 independent identically distributed (IID) tasks and 17 OOD tasks, GOAT outperforms current state-of-the-art methods by a large margin.
Researcher Affiliation Academia 1The Hong Kong University of Science and Technology 2Tsinghua University.
Pseudocode Yes Algorithm 1 GOAT Algorithm
Open Source Code Yes Code is available at https://github.com/YangRui2015/GOAT
Open Datasets Yes The introduced benchmark is modified from Mu Jo Co robotic manipulation environments (Plappert et al., 2018).
Dataset Splits No The paper defines 'training datasets' and 'evaluation tasks' (IID and OOD), and specifies the number of offline datasets and evaluation tasks. However, it does not provide specific percentages or counts for training, validation, or test splits, nor does it refer to standard predefined splits for the experimental data.
Hardware Specification Yes For our experiments, we use one single GPU (NVIDIA GeForce RTX 2080 Ti 11 GB) and one cpu core (Intel Xeon W-2245 CPU @ 3.90GHz).
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'tensorflow' but does not specify their version numbers, which is required for a reproducible description of software dependencies.
Experiment Setup Yes We use a batch size of 512, a discount factor of γ = 0.98, and an Adam optimizer with learning rate 5 10 4 for all algorithms. We also normalize the observations and goals with estimated mean and standard deviation. The relabel probability prelabel = 1 for most environments except for Slide Left-Right and Slide Near-Far, where prelabel = 0.2 and 0.5, respectively. In EAW, the ratio β is set to 2 and EAW is clipped into range (0, M] for numerical stability, where M is set to 10 in our experiments. For DSW, we utilize a First-In-First-Out (FIFO) queue Ba of size 5 104 to store recent calculated advantage values, and the percentile threshold α gradually increases from 0 to αmax. We use αmax = 80 for all tasks except Hand Reach and Slide Left-Right, and αmax = 50 for Hand Reach, αmax = 0 for Slide Left-Right. When A(s, a, g ) < c and c is the α quantile value of Ba, we set ϵ(A(s, a, g )) = 0.05 instead of 0 following (Yang et al., 2022b). For the uncertainty weight (UW), we use N = 5 ensemble Q networks to calculate the standard deviation Std(s, g) and maintain another FIFO queue Bstd to store recent Std(s, g) values. The Std(s, g) values are then normalized to [0, 1] with the maximum and minimum values in Bstd. Besides, wmin is set to 0.5 and w is searched from {1, 1.5, 2, 2.5}. Regarding the expectile regression (ER), we search τ {0.1, 0.3} because empirical results in Appendix D.5 shows that τ {0.1, 0.3} performs the best.