Supported Policy Optimization for Offline Reinforcement Learning
Authors: Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, Mingsheng Long
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments aim to evaluate our method comparatively, in contrast to prior offline RL methods, focusing on both offline training and online fine-tuning. We first demonstrate the effect of λ on applying support constraint and show that our method is able to learn a policy with the strongest performance at the same level of constraint strength, compared to previous policy constraint methods. We then evaluate SPOT on D4RL benchmark [6], studying how effective our method is in contrast to a broader range of state-of-the-art offline RL methods. |
| Researcher Affiliation | Academia | Jialong Wu1, Haixu Wu1, Zihan Qiu2, Jianmin Wang1, Mingsheng Long1 1School of Software, BNRist, Tsinghua University, China 2Institute for Interdisciplinary Information Sciences, Tsinghua University, China |
| Pseudocode | Yes | Algorithm 1 Supported Policy Optimization (SPOT) |
| Open Source Code | Yes | Code is available at https://github.com/thuml/SPOT. |
| Open Datasets | Yes | We focus on Gym-Mu Jo Co locomotion domains and much more challenging Ant Maze domains, which consists of sparse-reward tasks and requires stitching fragments of suboptimal trajectories traveling undirectedly in order to find a path from the start to the goal of the maze. ... D4RL benchmark [6] |
| Dataset Splits | No | The paper mentions tuning hyperparameters but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or counts) within its main text or appendix for reproduction purposes beyond referring to standard benchmarks. |
| Hardware Specification | No | The paper discusses computation cost and runtime but does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper states 'We implement our algorithm in PyTorch' but does not provide specific version numbers for PyTorch or any other software dependencies like Python or CUDA. |
| Experiment Setup | Yes | Appendix C provides detailed experimental setup information, including hyperparameters (e.g., learning rate 3e-4 for actors/critics, 1e-3 for VAE, batch size 256), network architectures (e.g., two-layer MLP for policy, Q-function, and VAE components with 256 units per layer), and training procedures. |