reproducibilityindex.ai

Model-based Offline Reinforcement Learning with Count-based Conservatism

Authors: Byeongchan Kim, Min-Hwan Oh

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive numerical experiments, we validate that Count-MORL with hash code implementation significantly outperforms existing offline RL algorithms on the D4RL benchmark datasets. The code is accessible at https://github.com/oh-lab/Count-MORL.
Researcher Affiliation	Academia	1Seoul National University, Seoul, South Korea. Correspondence to: Min-hwan Oh <minoh@snu.ac.kr>.
Pseudocode	Yes	Algorithm 1 : Count-based Conservatism for Model-based Offline RL (Count-MORL) Algorithm 2 : Count Estimation
Open Source Code	Yes	The code is accessible at https://github.com/oh-lab/Count-MORL.
Open Datasets	Yes	Mu Jo Co. We evaluate Count-MORL on datasets in the D4RL benchmark (Fu et al., 2020), which comprises a total of 12 datasets from 3 different environments (Half Cheetah, Hopper, and Walker2d), each with 4 dataset types (Random, Medium, Medium-Replay, Medium-Expert).
Dataset Splits	No	The paper mentions a validation prediction error on a held-out test set of 1000 transitions, but does not provide specific splits for training, validation, and test datasets in the general experimental setup section. It uses a "held-out test set" for model selection, which acts like a validation set, but no explicit train/val/test splits are given for the overall experiments.
Hardware Specification	No	The paper does not specify the hardware used (e.g., CPU, GPU models, or memory) for running the experiments.
Software Dependencies	No	The paper mentions using Soft Actor-Critic (SAC) and MOPO framework but does not specify software versions (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	Our algorithm adopts its foundational hyperparameters from the MOPO framework (Yu et al., 2020). While MOPO typically utilizes relatively short rollout lengths, such as 2 or 5 steps, recent research (Lu et al., 2022) emphasizes the crucial role that the rollout length parameter, denoted by h, plays in the performance of model-based offline RL algorithms. Therefore, we extend the rollout length to accommodate up to 20 steps. We select an optimal rollout length and a reward penalty coefficient from the following potential values: h {5, 20} and β {0.5, 1, 3, 5}. We found that the hyperparameters that have a significant influence on the performance of Count-MORL. We take a rollout length (H), a standard deviation coefficient (α) for the count estimation, a reward penalty coefficient (β) for each count estimation method, and a dimension of hash codes (d). For a rollout length and a reward penalty coefficient, we found that a length H {5, 20} and a coefficient β {0.5, 1, 3, 5} performed well across all datasets. This is a slight modification to the values of H {1, 5} and β {0.5, 1, 5} in previous model-based offline RL algorithms (Yu et al., 2020). And, in Lu et al. (2022), the authors show that a rollout length and a reward penalty coefficient play key parameters in determining the performance of model-based offline RL algorithms. Thus, we utilize the longer rollout length as 20 steps. We fix a standard deviation coefficient α as 0.5. Depending on the dataset, we take a dimension d of hash code over {16, 32, 50, 64, 80}.