Model-based Offline Reinforcement Learning with Count-based Conservatism
Authors: Byeongchan Kim, Min-Hwan Oh
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive numerical experiments, we validate that Count-MORL with hash code implementation significantly outperforms existing offline RL algorithms on the D4RL benchmark datasets. The code is accessible at https://github.com/oh-lab/Count-MORL. |
| Researcher Affiliation | Academia | 1Seoul National University, Seoul, South Korea. Correspondence to: Min-hwan Oh <minoh@snu.ac.kr>. |
| Pseudocode | Yes | Algorithm 1 : Count-based Conservatism for Model-based Offline RL (Count-MORL) Algorithm 2 : Count Estimation |
| Open Source Code | Yes | The code is accessible at https://github.com/oh-lab/Count-MORL. |
| Open Datasets | Yes | Mu Jo Co. We evaluate Count-MORL on datasets in the D4RL benchmark (Fu et al., 2020), which comprises a total of 12 datasets from 3 different environments (Half Cheetah, Hopper, and Walker2d), each with 4 dataset types (Random, Medium, Medium-Replay, Medium-Expert). |
| Dataset Splits | No | The paper mentions a validation prediction error on a held-out test set of 1000 transitions, but does not provide specific splits for training, validation, and test datasets in the general experimental setup section. It uses a "held-out test set" for model selection, which acts like a validation set, but no explicit train/val/test splits are given for the overall experiments. |
| Hardware Specification | No | The paper does not specify the hardware used (e.g., CPU, GPU models, or memory) for running the experiments. |
| Software Dependencies | No | The paper mentions using Soft Actor-Critic (SAC) and MOPO framework but does not specify software versions (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | Our algorithm adopts its foundational hyperparameters from the MOPO framework (Yu et al., 2020). While MOPO typically utilizes relatively short rollout lengths, such as 2 or 5 steps, recent research (Lu et al., 2022) emphasizes the crucial role that the rollout length parameter, denoted by h, plays in the performance of model-based offline RL algorithms. Therefore, we extend the rollout length to accommodate up to 20 steps. We select an optimal rollout length and a reward penalty coefficient from the following potential values: h {5, 20} and β {0.5, 1, 3, 5}. We found that the hyperparameters that have a significant influence on the performance of Count-MORL. We take a rollout length (H), a standard deviation coefficient (α) for the count estimation, a reward penalty coefficient (β) for each count estimation method, and a dimension of hash codes (d). For a rollout length and a reward penalty coefficient, we found that a length H {5, 20} and a coefficient β {0.5, 1, 3, 5} performed well across all datasets. This is a slight modification to the values of H {1, 5} and β {0.5, 1, 5} in previous model-based offline RL algorithms (Yu et al., 2020). And, in Lu et al. (2022), the authors show that a rollout length and a reward penalty coefficient play key parameters in determining the performance of model-based offline RL algorithms. Thus, we utilize the longer rollout length as 20 steps. We fix a standard deviation coefficient α as 0.5. Depending on the dataset, we take a dimension d of hash code over {16, 32, 50, 64, 80}. |