Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL

Authors: Yang Yue, Rui Lu, Bingyi Kang, Shiji Song, Gao Huang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experiments demonstrate perfect alignment with this theoretic analysis. Building on our insights, we propose to resolve divergence from a novel perspective, namely regularizing the neural network s generalization behavior. Through extensive empirical studies, we identify Layer Norm as a good solution to effectively avoid divergence without introducing detrimental bias, leading to superior performance. Experimental results prove that it can still work in some most challenging settings, i.e. using only 1% transitions of the dataset, where all previous methods fail.
Researcher Affiliation Collaboration 1 Department of Automation, BNRist, Tsinghua University 2 Byte Dance Inc.
Pseudocode No The paper describes methods and equations (e.g., Equation 1 for Q-value iteration), but no section or block is explicitly labeled 'Pseudocode' or 'Algorithm', nor are steps presented in a code-like structured format.
Open Source Code Yes Code can be found at https://offrl-seem.github.io.
Open Datasets Yes Our experiments are conducted on a widely used offline RL benchmark D4RL [10]... Previous state-of-the-art offline RL algorithms have performed exceptionally well on D4RL Mujoco Locomotion tasks... We construct transitionbased datasets by randomly sampling varying proportions (X%) from the D4RL Mujoco Locomotion datasets. Here, we set several levels for X {1, 10, 50, 100}.
Dataset Splits No The paper mentions using D4RL datasets (e.g., 'walker2d-medium-expert-v2', 'Antmaze') for training and evaluation. While D4RL datasets typically have predefined splits, the paper does not explicitly state the specific train/validation/test percentages or sample counts used for its experiments. The term 'validation' is used in the context of validating theoretical findings, not dataset splits.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU models, or memory specifications used for running its experiments.
Software Dependencies No The paper mentions using the Adam optimizer and building upon existing frameworks like TD3+BC and JAX implementations, but it does not provide specific version numbers for any programming languages, libraries, or software dependencies (e.g., Python version, PyTorch/TensorFlow version, CUDA version).
Experiment Setup Yes For the experiments presented in Section Section 3.1, we adopted TD3 as our baseline, but with a modification: instead of using an exponential moving average (EMA), we directly copied the current Q-network as the target network. The Adam optimizer was used with a learning rate of 0.0003, β1 = 0.9, and β2 = 0.999. The discount factor, γ, was set to 0.99.