Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning

Authors: Shentao Yang, Yihao Feng, Shujian Zhang, Mingyuan Zhou

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On a wide range of continuous-control offline RL datasets, our method indicates competitive performance, which validates our algorithm.
Researcher Affiliation Academia 1Mc Combs School of Business, 2Department of Computer Science, 3Department of Statistics & Data Science, The University of Texas at Austin.
Pseudocode Yes Algorithm 1 SDM-GAN, Main Steps
Open Source Code Yes The code is publicly available.
Open Datasets Yes On a wide range of continuous-control offline RL datasets from the D4RL benchmark (Fu et al., 2020), our method indicates competitive performance, which validates our algorithmic designs.
Dataset Splits No The paper discusses mixing 'Denv' (offline dataset) with 'Dmodel' (model-generated rollouts) using a ratio 'f' (D := f Denv + (1 f) Dmodel). While this affects the data used for training, it does not describe a traditional train/validation/test split for evaluating model generalization or tuning hyperparameters on a distinct validation set. The authors state 'We rollout our agent and the baselines for 10 episodes after each epoch of training' for evaluation, which is online evaluation, not validation on a dataset split.
Hardware Specification No The paper mentions support from 'the Texas Advanced Computing Center (TACC) for providing HPC resources', but this is a general statement and does not specify particular CPU or GPU models, or other detailed hardware specifications.
Software Dependencies No The paper mentions 'Optimizer Adam (Kingma & Ba, 2014)' in Table 6 but does not provide specific version numbers for Adam or any other software libraries, frameworks, or languages used.
Experiment Setup Yes Table 6 shows the hyperparameters shared across all datasets for our empirical study. It lists specific values for learning rates, batch size, discount factor, target network update rate, noise distribution, etc. Additionally, it states: 'Specifically, we tune the dimension of the noise distribution pz(z) for controlling the stochasticity of the learned policy, and the rollout horizon h for mitigating model estimation error.' and provides concrete values for these tuned parameters for different tasks.