Supported Policy Optimization for Offline Reinforcement Learning

Authors: Jialong Wu, Haixu Wu, Zihan Qiu, Jianmin Wang, Mingsheng Long

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aim to evaluate our method comparatively, in contrast to prior offline RL methods, focusing on both offline training and online fine-tuning. We first demonstrate the effect of λ on applying support constraint and show that our method is able to learn a policy with the strongest performance at the same level of constraint strength, compared to previous policy constraint methods. We then evaluate SPOT on D4RL benchmark [6], studying how effective our method is in contrast to a broader range of state-of-the-art offline RL methods.
Researcher Affiliation Academia Jialong Wu1, Haixu Wu1, Zihan Qiu2, Jianmin Wang1, Mingsheng Long1 1School of Software, BNRist, Tsinghua University, China 2Institute for Interdisciplinary Information Sciences, Tsinghua University, China
Pseudocode Yes Algorithm 1 Supported Policy Optimization (SPOT)
Open Source Code Yes Code is available at https://github.com/thuml/SPOT.
Open Datasets Yes We focus on Gym-Mu Jo Co locomotion domains and much more challenging Ant Maze domains, which consists of sparse-reward tasks and requires stitching fragments of suboptimal trajectories traveling undirectedly in order to find a path from the start to the goal of the maze. ... D4RL benchmark [6]
Dataset Splits No The paper mentions tuning hyperparameters but does not explicitly provide specific train/validation/test dataset splits (e.g., percentages or counts) within its main text or appendix for reproduction purposes beyond referring to standard benchmarks.
Hardware Specification No The paper discusses computation cost and runtime but does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for the experiments.
Software Dependencies No The paper states 'We implement our algorithm in PyTorch' but does not provide specific version numbers for PyTorch or any other software dependencies like Python or CUDA.
Experiment Setup Yes Appendix C provides detailed experimental setup information, including hyperparameters (e.g., learning rate 3e-4 for actors/critics, 1e-3 for VAE, batch size 256), network architectures (e.g., two-layer MLP for policy, Q-function, and VAE components with 256 units per layer), and training procedures.