reproducibilityindex.ai

On the Role of Discount Factor in Offline Reinforcement Learning

Authors: Hao Hu, Yiqin Yang, Qianchuan Zhao, Chongjie Zhang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically verify the above theoretical observation with tabular MDPs and standard D4RL tasks. The results show that the discount factor plays an essential role in the performance of ofﬂine RL algorithms, both under small data regimes upon existing ofﬂine methods and in large data regimes without other conservative methods.
Researcher Affiliation	Academia	1Institute of Interdisciplinary Information Sciences, Tsinghua University, Beijing, China 2Department of Automation, Tsinghua University, Beijing, China. Correspondence to: Hao Hu <hu-h19@mails.tsinghua.edu.cn>, Yiqing Yang <yangyiqi19@mails.tsinghua.edu.cn>.
Pseudocode	Yes	Algorithm 1 Pessimistic Value Iteration; Algorithm 2 Model-Based Pessimistic Policy Optimization; Algorithm 3 Generalized Value Iteration
Open Source Code	No	The paper mentions using "author-provided implementation or the recognized code" for existing algorithms but does not provide specific links or statements about making its own code open-source.
Open Datasets	Yes	We empirically verify the two effects on both tabular MDPs and the standard D4RL benchmark (Fu et al., 2020).
Dataset Splits	No	The paper describes its training datasets but does not explicitly specify exact percentages, sample counts, or refer to predefined splits for training, validation, and test sets. It focuses on the composition of the training data.
Hardware Specification	No	The paper describes running experiments but does not provide any specific details regarding the hardware (e.g., GPU models, CPU types, memory) used for these experiments.
Software Dependencies	No	The paper mentions using existing offline RL algorithms (e.g., TD3+BC, BCQ, COMBO, SAC) and states using author-provided or recognized code. However, it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup	Yes	The noised trajectories are fragments of the random datasets in D4RL... The proportion of masked state-action pairs is 0.5, and the noise ratio coefﬁcients are {4%, 6%, 8%, 12%}... Most scenarios in Table 2 adopt γ = 0.95 as a lower discount factor other than BCQ (γ = 0.9) and COMBO (γ = 0.9) in hopper tasks. We select γ = 0.95 in pen-expert-v0, hammer-expert-v0 and relocate-expert-v0 tasks. We select lower γ = 0.9 in door-expert-v0 tasks.