Supported Value Regularization for Offline Reinforcement Learning

Authors: Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, Xiangyang Ji

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we validate the theoretical properties of SVR in a tabular maze environment and demonstrate its state-of-the-art performance on a range of continuous control tasks in the D4RL benchmark. We aim to answer five questions: (1) Does SVR actually converge to the optimal support-constrained policy? (2) Does SVR perform better than previous methods on standard offline RL benchmarks? (3) When does SVR empirically benefit the most compared to the density-based regularization? (4) How should we select the sampling distribution of SVR in practice? (5) How does the implementation of each component affect SVR?
Researcher Affiliation Academia 1Department of Automation, Tsinghua University 2School of Artificial Intelligence, Dalian University of Technology
Pseudocode Yes Algorithm 1 Supported Value Regularization (SVR)
Open Source Code Yes Our code is available at https://github.com/MAOYIXIU/SVR.
Open Datasets Yes Then we evaluate our approach on the D4RL benchmarks [7]. We use a simple maze environment to verify the supportconstrained optimality of SVR. We first collect 10, 000 transitions using a random policy.
Dataset Splits No The paper uses the D4RL benchmark and a self-collected maze dataset. While it describes how it constructs the maze dataset and noisy D4RL datasets, it does not explicitly provide specific train/validation/test splits (e.g., percentages or counts) or refer to standard predefined splits within those datasets for reproducibility.
Hardware Specification Yes We test the runtime of SVR on halfcheetah-medium-replay-v2 on a Ge Force RTX 3090. Table 4: Runtime of TD3BC, IQL, CQL, SVR for halfcheetah-medium-replay-v2 on a Ge Force RTX 3090.
Software Dependencies No The paper mentions "Optimizer Adam [17]" but does not provide specific version numbers for Adam or any other software libraries or frameworks used (e.g., Python, PyTorch/TensorFlow versions).
Experiment Setup Yes All hyperparameters of SVR are included in Table 2. Table 2: Hyperparameters in SVR: Optimizer Adam [17], Critic learning rate 3e-4, Actor learning rate 3e-4 with cosine schedule, Batch size 256, Discount factor 0.99, Number of iterations 1e6, Target update rate τ 0.005, Policy update frequency 2, Number of Critics 4, Penalty coefficient α {0.001, 0.02} for Gym-Mu Jo Co {10} for Adroit, Standard deviation of u 0.2, Architecture Actor input-256-256-output Critic input-256-256-1. For evaluation, we average returns over 10 evaluation trajectories and 5 random seeds on all tasks.