Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Offline Reinforcement Learning with Mixture of Deterministic Policies

Authors: Takayuki Osa, Akinobu Hayashi, Pranav Deo, Naoki Morihira, Takahide Yoshiike

TMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets. [...] We empirically show that the use of a mixture of deterministic policies can reduce the accumulation of the approximation error in offline RL. [...] Through experiments with benchmark tasks in D4RL (Fu et al., 2020), we demonstrate that the proposed algorithms are competitive with prevalent offline RL methods.
Researcher Affiliation Collaboration Takayuki Osa EMAIL The University of Tokyo, RIKEN Akinobu Hayashi EMAIL Honda R&D Co., Ltd. Pranav Deo EMAIL Honda R&D Co., Ltd. Naoki Morihira EMAIL Honda R&D Co., Ltd. Takahide Yoshiike EMAIL Honda R&D Co., Ltd.
Pseudocode Yes Algorithm 1 Deterministic mixture policy optimization (DMPO)
Open Source Code Yes Our implementation is available at https://github.com/TakaOsa/DMPO.
Open Datasets Yes Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets.
Dataset Splits Yes A comparison between AWAC, mix AWAC, LP-AWAC, and DMPO is presented in Table 3. These methods incorporate importance weights based on the advantage function with different policy structures. [...] Average normalized scores over the last 10 test episodes and five seeds are shown. [...] We used the mujoco-v2, antmaze-v0, and adroit tasks on D4RL.
Hardware Specification Yes We used a workstation with GPU RTX A6000 and CPU Core i9-10980XE for this evaluation.
Software Dependencies Yes Pytorch 1.10.0 Mu Jo Co 2.1.0 mujoco-py 2.1.2.14
Experiment Setup Yes For other hyperparameter details, please refer to the Appendix F. [...] F Hyperparameters and implementation details [...] Tables 10–14 provide the hyperparameters used in the experiments.