Supported Value Regularization for Offline Reinforcement Learning
Authors: Yixiu Mao, Hongchang Zhang, Chen Chen, Yi Xu, Xiangyang Ji
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we validate the theoretical properties of SVR in a tabular maze environment and demonstrate its state-of-the-art performance on a range of continuous control tasks in the D4RL benchmark. We aim to answer five questions: (1) Does SVR actually converge to the optimal support-constrained policy? (2) Does SVR perform better than previous methods on standard offline RL benchmarks? (3) When does SVR empirically benefit the most compared to the density-based regularization? (4) How should we select the sampling distribution of SVR in practice? (5) How does the implementation of each component affect SVR? |
| Researcher Affiliation | Academia | 1Department of Automation, Tsinghua University 2School of Artificial Intelligence, Dalian University of Technology |
| Pseudocode | Yes | Algorithm 1 Supported Value Regularization (SVR) |
| Open Source Code | Yes | Our code is available at https://github.com/MAOYIXIU/SVR. |
| Open Datasets | Yes | Then we evaluate our approach on the D4RL benchmarks [7]. We use a simple maze environment to verify the supportconstrained optimality of SVR. We first collect 10, 000 transitions using a random policy. |
| Dataset Splits | No | The paper uses the D4RL benchmark and a self-collected maze dataset. While it describes how it constructs the maze dataset and noisy D4RL datasets, it does not explicitly provide specific train/validation/test splits (e.g., percentages or counts) or refer to standard predefined splits within those datasets for reproducibility. |
| Hardware Specification | Yes | We test the runtime of SVR on halfcheetah-medium-replay-v2 on a Ge Force RTX 3090. Table 4: Runtime of TD3BC, IQL, CQL, SVR for halfcheetah-medium-replay-v2 on a Ge Force RTX 3090. |
| Software Dependencies | No | The paper mentions "Optimizer Adam [17]" but does not provide specific version numbers for Adam or any other software libraries or frameworks used (e.g., Python, PyTorch/TensorFlow versions). |
| Experiment Setup | Yes | All hyperparameters of SVR are included in Table 2. Table 2: Hyperparameters in SVR: Optimizer Adam [17], Critic learning rate 3e-4, Actor learning rate 3e-4 with cosine schedule, Batch size 256, Discount factor 0.99, Number of iterations 1e6, Target update rate τ 0.005, Policy update frequency 2, Number of Critics 4, Penalty coefficient α {0.001, 0.02} for Gym-Mu Jo Co {10} for Adroit, Standard deviation of u 0.2, Architecture Actor input-256-256-output Critic input-256-256-1. For evaluation, we average returns over 10 evaluation trajectories and 5 random seeds on all tasks. |