Beyond OOD State Actions: Supported Cross-Domain Offline Reinforcement Learning
Authors: Jinxin Liu, Ziqi Zhang, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, Donglin Wang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments with a variety of source domains that have transition dynamics mismatch and demonstrate that BOSA contributes to significant gains on learning from cross-domain offline data. Further, we show that BOSA can be plugged into more general cross-domain offline settings: model-based RL and (noising) data augmentation. |
| Researcher Affiliation | Academia | Jinxin Liu*, Ziqi Zhang*, Zhenyu Wei, Zifeng Zhuang, Yachen Kang, Sibo Gai, Donglin Wang School of Enginneering, Westlake University |
| Pseudocode | No | The paper describes its methods using mathematical formulations and descriptive text, but it does not include a dedicated 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | Due to page limitations, we leave the technical details and supplementary appendix to https://arxiv.org/pdf/2306.12755.pdf. |
| Open Datasets | Yes | We use the D4RL [Fu et al. 2020] offline data as the target domain and use the similar cross-domain dynamics modification utilized in DARA [Liu, Zhang, and Wang 2022] to collect source-domain data. |
| Dataset Splits | No | The paper states 'we only use 10% of the D4RL data in the target domain,' which refers to the amount of data used for training, but it does not provide explicit train/validation/test dataset splits (e.g., percentages or sample counts) or clearly reference predefined splits with citations for reproducibility. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | Then, in Figure 3 (c), we study the hyper-parameter sensitivity on the thresholds ϵth and ϵ th in supported policy and value optimization respectively. We can see that BOSA is nearly robust when varying the ϵth and ϵ th, which consistently outperforms SPOT with 10% D4RL data. Empirically, we find the performance can be improved by increasing the ensemble size, but improvement is saturated around 5. Thus, we learn 5 models in ensemble. We average our results over 5 seeds and for each seed, we compute the normalized average score using 10 episodes. |