State Regularized Policy Optimization on Data with Dynamics Shift

Authors: Zhenghai Xue, Qingpeng Cai, Shuchang Liu, Dong Zheng, Peng Jiang, Kun Gai, Bo An

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance. In this section, we conduct experiments to investigate the following questions: (1) Can SRPO leverage data with distribution shift and outperform current SOTA algorithms in the setting of Hi P-MDP, in both online and offline RL?
Researcher Affiliation Collaboration 1Nanyang Technology University, Singapore 2Kuaishou Technology 3 Unaffliated
Pseudocode Yes Algorithm 1 The workflow of SRPO on top of MAPLE [12].
Open Source Code No The paper does not provide an explicit statement or a link indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets Yes Then a set of states is sampled from the D4RL [42] dataset and classified into two sets according to the output of Dδ. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. Co RR, abs/2004.07219, 2020.
Dataset Splits No The paper mentions using the D4RL dataset but does not provide specific details on how it was split into training, validation, and test sets for their experiments, or cite a standard split.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions using the MuJoCo simulator and various RL algorithms (PPO, Ca DM, MAPLE, CQL) but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We alter the simulator gravity to generate different dynamics in online experiments. Possible values of gravity are {1.0}, {0.7,1.0,1.3}, and {0.4,0.7,1.0,1.3,1.6} in experiments with 1, 3, and 5 kinds of different dynamics, respectively. We set ρ = 0.5 in offline experiments with medium-expert level of data. ρ = 0.2 is set in all other experiments. λ is regarded as a hyperparameter with values 0.1 or 0.3.