Exclusively Penalized Q-learning for Offline Reinforcement Learning

Authors: Junghyuk Yeom, Yonghyeon Jo, Jeongmo Kim, Sanghyeon Lee, Seungyul Han

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical results show that our method significantly reduces underestimation bias and improves performance in various offline control tasks compared to other offline RL methods. Numerical results reveal that EPQ significantly outperforms other state-of-the-art offline RL algorithms on various D4RL tasks [23].
Researcher Affiliation Academia Junghyuk Yeom Yonghyeon Jo Jungmo Kim Sanghyeon Lee Seungyul Han Graduate School of Artificial Intelligence UNIST Ulsan, South Korea 44919 {junghyukyum,yonghyeonjo,jmkim22,sanghyeon,syhan}@unist.ac.kr
Pseudocode Yes Algorithm 1 Exclusively Penalized Q-learning
Open Source Code Yes The data and code for reproducing the main experimental results are included in supplemental materials.
Open Datasets Yes In this section, we evaluate our proposed EPQ against other state-of-the-art offline RL algorithms using the D4RL benchmark [23], commonly used in the offline RL domain.
Dataset Splits No The paper mentions D4RL tasks and various datasets (e.g., Mujoco, Adroit, Ant Maze) but does not provide explicit details on train/validation/test splits, such as percentages or sample counts for each split.
Hardware Specification Yes We conduct our experiments on a single server equipped with an Intel Xeon Gold 6336Y CPU and one NVIDIA RTX A5000 GPU.
Software Dependencies No The paper mentions using VAE [53] for behavior policy estimation, the nearest neighbor (NN) algorithm [54] from a Python library, and concepts from DQN [52] and Soft Actor-Critic [4, 56], but it does not specify version numbers for any software, libraries, or frameworks used.
Experiment Setup Yes First, we provide the details of the shared algorithm hyperparameters in Table 4. In Table 4, we compare the shared algorithm hyperparameters of CQL, the revised version of CQL (revised), and proposed EPQ. [...] In addition, in Table 5, we provide the details of the task hyperparameters regarding our contributions in the proposed EPQ: the penalty control threshold τ and the IS clipping factor cmin in the Q-loss implementation in (B.2), and the cluster radius ϵ and regularizing temperature ζ for the practical implementation of IS clipping factor w Q s,a in Section B.4.