Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Offline Reinforcement Learning with Mixture of Deterministic Policies
Authors: Takayuki Osa, Akinobu Hayashi, Pranav Deo, Naoki Morihira, Takahide Yoshiike
TMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show empirically that the use of the proposed mixture policy can reduce the accumulation of the critic loss in offline RL, which was reported in previous studies. Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets. [...] We empirically show that the use of a mixture of deterministic policies can reduce the accumulation of the approximation error in offline RL. [...] Through experiments with benchmark tasks in D4RL (Fu et al., 2020), we demonstrate that the proposed algorithms are competitive with prevalent offline RL methods. |
| Researcher Affiliation | Collaboration | Takayuki Osa EMAIL The University of Tokyo, RIKEN Akinobu Hayashi EMAIL Honda R&D Co., Ltd. Pranav Deo EMAIL Honda R&D Co., Ltd. Naoki Morihira EMAIL Honda R&D Co., Ltd. Takahide Yoshiike EMAIL Honda R&D Co., Ltd. |
| Pseudocode | Yes | Algorithm 1 Deterministic mixture policy optimization (DMPO) |
| Open Source Code | Yes | Our implementation is available at https://github.com/TakaOsa/DMPO. |
| Open Datasets | Yes | Experimental results also indicate that using a mixture of deterministic policies in offline RL improves the performance with the D4RL benchmarking datasets. |
| Dataset Splits | Yes | A comparison between AWAC, mix AWAC, LP-AWAC, and DMPO is presented in Table 3. These methods incorporate importance weights based on the advantage function with different policy structures. [...] Average normalized scores over the last 10 test episodes and five seeds are shown. [...] We used the mujoco-v2, antmaze-v0, and adroit tasks on D4RL. |
| Hardware Specification | Yes | We used a workstation with GPU RTX A6000 and CPU Core i9-10980XE for this evaluation. |
| Software Dependencies | Yes | Pytorch 1.10.0 Mu Jo Co 2.1.0 mujoco-py 2.1.2.14 |
| Experiment Setup | Yes | For other hyperparameter details, please refer to the Appendix F. [...] F Hyperparameters and implementation details [...] Tables 10–14 provide the hyperparameters used in the experiments. |