Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning

Authors: Jinxin Liu, Ziqi Zhang, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, Donglin Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments with a variety of source domains that have transition dynamics mismatch and demonstrate that BOSA contributes to significant gains on learning from cross-domain offline data. Further, we show that BOSA can be plugged into more general cross-domain offline settings: model-based RL and (noising) data augmentation.
Researcher Affiliation Academia Jinxin Liu*, Ziqi Zhang*, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, Donglin Wang School of Enginneering, Westlake University
Pseudocode No The paper describes its methods using mathematical formulations and descriptive text, but it does not include a dedicated 'Pseudocode' or 'Algorithm' block.
Open Source Code No Due to page limitations, we leave the technical details and supplementary appendix to https://arxiv.org/pdf/2306.12755.pdf.
Open Datasets Yes We use the D4RL [Fu et al. 2020] offline data as the target domain and use the similar cross-domain dynamics modification utilized in DARA [Liu, Zhang, and Wang 2022] to collect source-domain data.
Dataset Splits No The paper states 'we only use 10% of the D4RL data in the target domain,' which refers to the amount of data used for training, but it does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts) or clearly reference predefined splits with citations for reproducibility.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes Then, in Figure 3 (c), we study the hyper-parameter sensitivity on the thresholds ϵth and ϵ th in supported policy and value optimization respectively. We can see that BOSA is nearly robust when varying the ϵth and ϵ th, which consistently outperforms SPOT with 10% D4RL data. Empirically, we find the performance can be improved by increasing the ensemble size, but improvement is saturated around 5. Thus, we learn 5 models in ensemble. We average our results over 5 seeds and for each seed, we compute the normalized average score using 10 episodes.