Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning

Authors: Guoxi Zhang, Hisashi Kashima

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Lastly, with extensive empirical evaluation this work confirms the risks of neglecting data heterogeneity and the efficacy of the proposed model.
Researcher Affiliation Academia 1 Graduate School of Inforamtics, Kyoto University 2 RIKEN Guardian Robot Project guoxi@ml.ist.i.kyoto-u.ac.jp, kashima@i.kyoto-u.ac.jp
Pseudocode No The paper describes algorithms and models in text and diagrams, but it does not contain structured pseudocode or algorithm blocks with explicit labels.
Open Source Code Yes Other details are available in Appendix A and the code is available here2. 2https://github.com/Altriaex/multi source behavior modeling
Open Datasets Yes This study validates its claims empirically using the D4RL benchmark (Fu et al. 2020) and 15 new datasets. Experiment results show that algorithms that estimate single behavior policy worsened on multi-source data, which confirms the detriment of neglecting data heterogeneity. 1These datasets are available at https://zenodo.org/record/ 7375417#.Y4Wzti9KGg Q.
Dataset Splits No The paper describes the generation of new datasets ('heterogeneous-k') and mentions '20 test runs' for evaluation, but it does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit references to standard splits for their own experiments).
Hardware Specification No The paper does not explicitly describe the specific hardware used to run its experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies No The paper mentions using code from other works (e.g., 'the code provided by Fu et al. (2020) for BRAC-v and BCQ and the official code for CQL and PLAS'), but it does not specify version numbers for any key software components or libraries used in their own implementation.
Experiment Setup Yes de was set to eight. fs and fp were parameterized by two layers of feed-forward networks with 200 hidden units, while fs,a and f Q were parameterized similarly but with 300 hidden units. The learning rates for the policy network and the Q-network were 5 10 5 and 1 10 4. Other details are available in Appendix A and the code is available here2.