Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL
Authors: Yang Yue, Rui Lu, Bingyi Kang, Shiji Song, Gao Huang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely regularizing the neural network s generalization behavior. Through extensive empirical studies, we identify Layer Norm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1% transitions of the dataset, where all previous methods fail. |
| Researcher Affiliation | Collaboration | 1 Department of Automation, BNRist, Tsinghua University 2 Byte Dance Inc. |
| Pseudocode | No | The paper describes methods and equations (e.g., Equation 1 for Q-value iteration), but no section or block is explicitly labeled 'Pseudocode' or 'Algorithm', nor are steps presented in a code-like structured format. |
| Open Source Code | Yes | Code can be found at https://offrl-seem.github.io. |
| Open Datasets | Yes | Our experiments are conducted on a widely used offline RL benchmark D4RL [10]... Previous state-of-the-art offline RL algorithms have performed exceptionally well on D4RL Mujoco Locomotion tasks... We construct transitionbased datasets by randomly sampling varying proportions (X%) from the D4RL Mujoco Locomotion datasets. Here, we set several levels for X {1, 10, 50, 100}. |
| Dataset Splits | No | The paper mentions using D4RL datasets (e.g., 'walker2d-medium-expert-v2', 'Antmaze') for training and evaluation. While D4RL datasets typically have predefined splits, the paper does not explicitly state the specific train/validation/test percentages or sample counts used for its experiments. The term 'validation' is used in the context of validating theoretical findings, not dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments. |
| Software Dependencies | No | The paper mentions using the Adam optimizer and building upon existing frameworks like TD3+BC and JAX implementations, but it does not provide specific version numbers for any programming languages, libraries, or software dependencies (e.g., Python version, PyTorch/TensorFlow version, CUDA version). |
| Experiment Setup | Yes | For the experiments presented in Section Section 3.1, we adopted TD3 as our baseline, but with a modification: instead of using an exponential moving average (EMA), we directly copied the current Q-network as the target network. The Adam optimizer was used with a learning rate of 0.0003, β1 = 0.9, and β2 = 0.999. The discount factor, γ, was set to 0.99. |