Offline RL with No OOD Actions: In-Sample Learning via Implicit Value Regularization

Authors: Haoran Xu, Li Jiang, Jianxiong Li, Zhuoran Yang, Zhaoran Wang, Victor Wai Kin Chan, Xianyuan Zhan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present empirical evaluations of SQL and EQL in this section. We first evaluate SQL and EQL against other baseline algorithms on benchmark offline RL datasets.
Researcher Affiliation Collaboration 1Institute for AI Industry Research (AIR), Tsinghua University 2Tsinghua-Berkeley Shenzhen Institute (TBSI), Tsinghua University 3Yale University 4Northwestern University 5Shanghai Artificial Intelligence Laboratory *Work done while at JD Technology.
Pseudocode Yes We summarize the training procedure in Algorithm 1.
Open Source Code Yes Code is available at https://github.com/ryanxhr/IVR.
Open Datasets Yes We first evaluate our approach on D4RL datasets (Fu et al., 2020).
Dataset Splits No The paper refers to using D4RL datasets and performing evaluations, but it does not explicitly state specific training, validation, and test split percentages or sample counts for reproduction.
Hardware Specification No The paper mentions implementing the method in JAX but does not provide any specific GPU, CPU, or cloud hardware specifications used for the experiments.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'JAX', and 'd3rlpy (Seno & Imai, 2021)', but does not provide specific version numbers for JAX or d3rlpy, which are key software dependencies.
Experiment Setup Yes In SQL and EQL, we use 2-layer MLP with 256 hidden units, we use Adam optimizer (Kingma & Ba, 2015) with a learning rate of 2 10 4 for all neural networks. Following Mnih et al. (2013); Lillicrap et al. (2016), we introduce a target critic network with soft update weight 5 10 3. [...] The only hyperparameter α used in SQL and EQL is listed in Table 5.