Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Offline Reinforcement Learning with Mixture of Deterministic Policies

Authors: Takayuki Osa, Akinobu Hayashi, Pranav Deo, Naoki Morihira, Takahide Yoshiike

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets. [...] We empirically show that the use of a mixture of deterministic policies can reduce the accumulation of the approximation error in offline RL. [...] Through experiments with benchmark tasks in D4RL (Fu et al., 2020), we demonstrate that the proposed algorithms are competitive with prevalent offline RL methods.
Researcher Affiliation	Collaboration	Takayuki Osa EMAIL The University of Tokyo, RIKEN Akinobu Hayashi EMAIL Honda R&D Co., Ltd. Pranav Deo EMAIL Honda R&D Co., Ltd. Naoki Morihira EMAIL Honda R&D Co., Ltd. Takahide Yoshiike EMAIL Honda R&D Co., Ltd.
Pseudocode	Yes	Algorithm 1 Deterministic mixture policy optimization (DMPO)
Open Source Code	Yes	Our implementation is available at https://github.com/TakaOsa/DMPO.
Open Datasets	Yes	Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets.
Dataset Splits	Yes	A comparison between AWAC, mix AWAC, LP-AWAC, and DMPO is presented in Table 3. These methods incorporate importance weights based on the advantage function with different policy structures. [...] Average normalized scores over the last 10 test episodes and five seeds are shown. [...] We used the mujoco-v2, antmaze-v0, and adroit tasks on D4RL.
Hardware Specification	Yes	We used a workstation with GPU RTX A6000 and CPU Core i9-10980XE for this evaluation.
Software Dependencies	Yes	Pytorch 1.10.0 Mu Jo Co 2.1.0 mujoco-py 2.1.2.14
Experiment Setup	Yes	For other hyperparameter details, please refer to the Appendix F. [...] F Hyperparameters and implementation details [...] Tables 10–14 provide the hyperparameters used in the experiments.