reproducibilityindex.ai

Regularizing a Model-based Policy Stationary Distribution to Stabilize Offline Reinforcement Learning

Authors: Shentao Yang, Yihao Feng, Shujian Zhang, Mingyuan Zhou

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On a wide range of continuous-control offline RL datasets, our method indicates competitive performance, which validates our algorithm.
Researcher Affiliation	Academia	1Mc Combs School of Business, 2Department of Computer Science, 3Department of Statistics & Data Science, The University of Texas at Austin.
Pseudocode	Yes	Algorithm 1 SDM-GAN, Main Steps
Open Source Code	Yes	The code is publicly available.
Open Datasets	Yes	On a wide range of continuous-control offline RL datasets from the D4RL benchmark (Fu et al., 2020), our method indicates competitive performance, which validates our algorithmic designs.
Dataset Splits	No	The paper discusses mixing 'Denv' (offline dataset) with 'Dmodel' (model-generated rollouts) using a ratio 'f' (D := f Denv + (1 f) Dmodel). While this affects the data used for training, it does not describe a traditional train/validation/test split for evaluating model generalization or tuning hyperparameters on a distinct validation set. The authors state 'We rollout our agent and the baselines for 10 episodes after each epoch of training' for evaluation, which is online evaluation, not validation on a dataset split.
Hardware Specification	No	The paper mentions support from 'the Texas Advanced Computing Center (TACC) for providing HPC resources', but this is a general statement and does not specify particular CPU or GPU models, or other detailed hardware specifications.
Software Dependencies	No	The paper mentions 'Optimizer Adam (Kingma & Ba, 2014)' in Table 6 but does not provide specific version numbers for Adam or any other software libraries, frameworks, or languages used.
Experiment Setup	Yes	Table 6 shows the hyperparameters shared across all datasets for our empirical study. It lists specific values for learning rates, batch size, discount factor, target network update rate, noise distribution, etc. Additionally, it states: 'Specifically, we tune the dimension of the noise distribution pz(z) for controlling the stochasticity of the learned policy, and the rollout horizon h for mitigating model estimation error.' and provides concrete values for these tuned parameters for different tasks.