State Deviation Correction for Offline Reinforcement Learning

Authors: Hongchang Zhang, Jianzhun Shao, Yuhang Jiang, Shuncheng He, Guanwen Zhang, Xiangyang Ji9022-9030

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results demonstrate that our proposed method is competitive with the state-of-the-art methods in a Grid World setup, offline Mujoco control suite, and a modified offline Mujoco dataset with a finite number of valuable samples.
Researcher Affiliation Academia Hongchang Zhang1, Jianzhun Shao1, Yuhang Jiang1, Shuncheng He1, Guanwen Zhang2, Xiangyang Ji1* 1 Tsinghua University, 2 Northwestern Polytechnical University hc-zhang19@mails.tsinghua.edu.cn
Pseudocode Yes Algorithm 1: State Deviation Correction
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes For the dataset, we use a Grid World setting and the Mujoco datasets in the D4RL benchmarks (Fu et al. 2020).Fu, J.; Kumar, A.; Nachum, O.; Tucker, G.; and Levine, S. 2020. D4rl: Datasets for deep data-driven reinforcement learning. ar Xiv preprint ar Xiv:2004.07219.
Dataset Splits No The paper uses D4RL datasets but does not explicitly provide specific details about the training, validation, and test splits (e.g., percentages, sample counts, or explicit references to predefined splits used for reproduction).
Hardware Specification No The paper does not provide specific hardware details (such as GPU or CPU models, memory, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions various algorithms and models (e.g., soft actor-critic, CVAE) and tools (t-SNE), but it does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, scikit-learn versions) required to replicate the experiment.
Experiment Setup Yes We choose η = 0.05 in our experiment. In our implementation, we use Gaussian kernels and set n = m = 4. SDC first adds a noise ϵ with small magnitude to the state and formulates a noisy state as: ˆs = s + β ϵ, where ϵ is sampled from a Gaussian distribution N(0, 1) and β is a small constant. For each task, we train a CQL agent, a BEAR agent, and an SDC agent for 1,000,000 updates.