On the Role of Discount Factor in Offline Reinforcement Learning
Authors: Hao Hu, Yiqin Yang, Qianchuan Zhao, Chongjie Zhang
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of offline RL algorithms, both under small data regimes upon existing offline methods and in large data regimes without other conservative methods. |
| Researcher Affiliation | Academia | 1Institute of Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Automation, Tsinghua University, Beijing, China. Correspondence to: Hao Hu <hu-h19@mails.tsinghua.edu.cn>, Yiqing Yang <yangyiqi19@mails.tsinghua.edu.cn>. |
| Pseudocode | Yes | Algorithm 1 Pessimistic Value Iteration; Algorithm 2 Model-Based Pessimistic Policy Optimization; Algorithm 3 Generalized Value Iteration |
| Open Source Code | No | The paper mentions using "author-provided implementation or the recognized code" for existing algorithms but does not provide specific links or statements about making its own code open-source. |
| Open Datasets | Yes | We empirically verify the two effects on both tabular MDPs and the standard D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper describes its training datasets but does not explicitly specify exact percentages, sample counts, or refer to predefined splits for training, validation, and test sets. It focuses on the composition of the training data. |
| Hardware Specification | No | The paper describes running experiments but does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used for these experiments. |
| Software Dependencies | No | The paper mentions using existing offline RL algorithms (e.g., TD3+BC, BCQ, COMBO, SAC) and states using author-provided or recognized code. However, it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | The noised trajectories are fragments of the random datasets in D4RL... The proportion of masked state-action pairs is 0.5, and the noise ratio coefficients are {4%, 6%, 8%, 12%}... Most scenarios in Table 2 adopt γ = 0.95 as a lower discount factor other than BCQ (γ = 0.9) and COMBO (γ = 0.9) in hopper tasks. We select γ = 0.95 in pen-expert-v0, hammer-expert-v0 and relocate-expert-v0 tasks. We select lower γ = 0.9 in door-expert-v0 tasks. |