reproducibilityindex.ai

Behavior Estimation from Multi-Source Data for Offline Reinforcement Learning

Authors: Guoxi Zhang, Hisashi Kashima

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Lastly, with extensive empirical evaluation this work conﬁrms the risks of neglecting data heterogeneity and the efﬁcacy of the proposed model.
Researcher Affiliation	Academia	1 Graduate School of Inforamtics, Kyoto University 2 RIKEN Guardian Robot Project guoxi@ml.ist.i.kyoto-u.ac.jp, kashima@i.kyoto-u.ac.jp
Pseudocode	No	The paper describes algorithms and models in text and diagrams, but it does not contain structured pseudocode or algorithm blocks with explicit labels.
Open Source Code	Yes	Other details are available in Appendix A and the code is available here2. 2https://github.com/Altriaex/multi source behavior modeling
Open Datasets	Yes	This study validates its claims empirically using the D4RL benchmark (Fu et al. 2020) and 15 new datasets. Experiment results show that algorithms that estimate single behavior policy worsened on multi-source data, which conﬁrms the detriment of neglecting data heterogeneity. 1These datasets are available at https://zenodo.org/record/ 7375417#.Y4Wzti9KGg Q.
Dataset Splits	No	The paper describes the generation of new datasets ('heterogeneous-k') and mentions '20 test runs' for evaluation, but it does not explicitly provide specific details about training, validation, or test dataset splits (e.g., percentages, sample counts, or explicit references to standard splits for their own experiments).
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run its experiments (e.g., specific GPU/CPU models, memory, or cloud instance types).
Software Dependencies	No	The paper mentions using code from other works (e.g., 'the code provided by Fu et al. (2020) for BRAC-v and BCQ and the ofﬁcial code for CQL and PLAS'), but it does not specify version numbers for any key software components or libraries used in their own implementation.
Experiment Setup	Yes	de was set to eight. fs and fp were parameterized by two layers of feed-forward networks with 200 hidden units, while fs,a and f Q were parameterized similarly but with 300 hidden units. The learning rates for the policy network and the Q-network were 5 10 5 and 1 10 4. Other details are available in Appendix A and the code is available here2.