Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning
Authors: Shentao Yang, Yihao Feng, Shujian Zhang, Mingyuan Zhou
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On a wide range of continuous-control offline RL datasets, our method indicates competitive performance, which validates our algorithm. |
| Researcher Affiliation | Academia | 1Mc Combs School of Business, 2Department of Computer Science, 3Department of Statistics & Data Science, The University of Texas at Austin. |
| Pseudocode | Yes | Algorithm 1 SDM-GAN, Main Steps |
| Open Source Code | Yes | The code is publicly available. |
| Open Datasets | Yes | On a wide range of continuous-control offline RL datasets from the D4RL benchmark (Fu et al., 2020), our method indicates competitive performance, which validates our algorithmic designs. |
| Dataset Splits | No | The paper discusses mixing 'Denv' (offline dataset) with 'Dmodel' (model-generated rollouts) using a ratio 'f' (D := f Denv + (1 f) Dmodel). While this affects the data used for training, it does not describe a traditional train/validation/test split for evaluating model generalization or tuning hyperparameters on a distinct validation set. The authors state 'We rollout our agent and the baselines for 10 episodes after each epoch of training' for evaluation, which is online evaluation, not validation on a dataset split. |
| Hardware Specification | No | The paper mentions support from 'the Texas Advanced Computing Center (TACC) for providing HPC resources', but this is a general statement and does not specify particular CPU or GPU models, or other detailed hardware specifications. |
| Software Dependencies | No | The paper mentions 'Optimizer Adam (Kingma & Ba, 2014)' in Table 6 but does not provide specific version numbers for Adam or any other software libraries, frameworks, or languages used. |
| Experiment Setup | Yes | Table 6 shows the hyperparameters shared across all datasets for our empirical study. It lists specific values for learning rates, batch size, discount factor, target network update rate, noise distribution, etc. Additionally, it states: 'Specifically, we tune the dimension of the noise distribution pz(z) for controlling the stochasticity of the learned policy, and the rollout horizon h for mitigating model estimation error.' and provides concrete values for these tuned parameters for different tasks. |