Model-based Offline Reinforcement Learning with Count-based Conservatism

Authors: Byeongchan Kim, Min-Hwan Oh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive numerical experiments, we validate that Count-MORL with hash code implementation significantly outperforms existing offline RL algorithms on the D4RL benchmark datasets. The code is accessible at https://github.com/oh-lab/Count-MORL.
Researcher Affiliation Academia 1Seoul National University, Seoul, South Korea. Correspondence to: Min-hwan Oh <minoh@snu.ac.kr>.
Pseudocode Yes Algorithm 1 : Count-based Conservatism for Model-based Offline RL (Count-MORL) Algorithm 2 : Count Estimation
Open Source Code Yes The code is accessible at https://github.com/oh-lab/Count-MORL.
Open Datasets Yes Mu Jo Co. We evaluate Count-MORL on datasets in the D4RL benchmark (Fu et al., 2020), which comprises a total of 12 datasets from 3 different environments (Half Cheetah, Hopper, and Walker2d), each with 4 dataset types (Random, Medium, Medium-Replay, Medium-Expert).
Dataset Splits No The paper mentions a validation prediction error on a held-out test set of 1000 transitions, but does not provide specific splits for training, validation, and test datasets in the general experimental setup section. It uses a "held-out test set" for model selection, which acts like a validation set, but no explicit train/val/test splits are given for the overall experiments.
Hardware Specification No The paper does not specify the hardware used (e.g., CPU, GPU models, or memory) for running the experiments.
Software Dependencies No The paper mentions using Soft Actor-Critic (SAC) and MOPO framework but does not specify software versions (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes Our algorithm adopts its foundational hyperparameters from the MOPO framework (Yu et al., 2020). While MOPO typically utilizes relatively short rollout lengths, such as 2 or 5 steps, recent research (Lu et al., 2022) emphasizes the crucial role that the rollout length parameter, denoted by h, plays in the performance of model-based offline RL algorithms. Therefore, we extend the rollout length to accommodate up to 20 steps. We select an optimal rollout length and a reward penalty coefficient from the following potential values: h {5, 20} and β {0.5, 1, 3, 5}. We found that the hyperparameters that have a significant influence on the performance of Count-MORL. We take a rollout length (H), a standard deviation coefficient (α) for the count estimation, a reward penalty coefficient (β) for each count estimation method, and a dimension of hash codes (d). For a rollout length and a reward penalty coefficient, we found that a length H {5, 20} and a coefficient β {0.5, 1, 3, 5} performed well across all datasets. This is a slight modification to the values of H {1, 5} and β {0.5, 1, 5} in previous model-based offline RL algorithms (Yu et al., 2020). And, in Lu et al. (2022), the authors show that a rollout length and a reward penalty coefficient play key parameters in determining the performance of model-based offline RL algorithms. Thus, we utilize the longer rollout length as 20 steps. We fix a standard deviation coefficient α as 0.5. Depending on the dataset, we take a dimension d of hash code over {16, 32, 50, 64, 80}.