Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning
Authors: Guoxi Zhang, Hisashi Kashima
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Lastly, with extensive empirical evaluation this work confirms the risks of neglecting data heterogeneity and the efficacy of the proposed model. |
| Researcher Affiliation | Academia | 1 Graduate School of Inforamtics, Kyoto University 2 RIKEN Guardian Robot Project guoxi@ml.ist.i.kyoto-u.ac.jp, kashima@i.kyoto-u.ac.jp |
| Pseudocode | No | The paper describes algorithms and models in text and diagrams, but it does not contain structured pseudocode or algorithm blocks with explicit labels. |
| Open Source Code | Yes | Other details are available in Appendix A and the code is available here2. 2https://github.com/Altriaex/multi source behavior modeling |
| Open Datasets | Yes | This study validates its claims empirically using the D4RL benchmark (Fu et al. 2020) and 15 new datasets. Experiment results show that algorithms that estimate single behavior policy worsened on multi-source data, which confirms the detriment of neglecting data heterogeneity. 1These datasets are available at https://zenodo.org/record/ 7375417#.Y4Wzti9KGg Q. |
| Dataset Splits | No | The paper describes the generation of new datasets ('heterogeneous-k') and mentions '20 test runs' for evaluation, but it does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit references to standard splits for their own experiments). |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used to run its experiments (e.g., specific GPU/CPU models, memory, or cloud instance types). |
| Software Dependencies | No | The paper mentions using code from other works (e.g., 'the code provided by Fu et al. (2020) for BRAC-v and BCQ and the official code for CQL and PLAS'), but it does not specify version numbers for any key software components or libraries used in their own implementation. |
| Experiment Setup | Yes | de was set to eight. fs and fp were parameterized by two layers of feed-forward networks with 200 hidden units, while fs,a and f Q were parameterized similarly but with 300 hidden units. The learning rates for the policy network and the Q-network were 5 10 5 and 1 10 4. Other details are available in Appendix A and the code is available here2. |