State Regularized Policy Optimization on Data with Dynamics Shift
Authors: Zhenghai Xue, Qingpeng Cai, Shuchang Liu, Dong Zheng, Peng Jiang, Kun Gai, Bo An
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance. In this section, we conduct experiments to investigate the following questions: (1) Can SRPO leverage data with distribution shift and outperform current SOTA algorithms in the setting of Hi P-MDP, in both online and offline RL? |
| Researcher Affiliation | Collaboration | 1Nanyang Technology University, Singapore 2Kuaishou Technology 3 Unaffliated |
| Pseudocode | Yes | Algorithm 1 The workflow of SRPO on top of MAPLE [12]. |
| Open Source Code | No | The paper does not provide an explicit statement or a link indicating that the source code for the described methodology is open-source or publicly available. |
| Open Datasets | Yes | Then a set of states is sampled from the D4RL [42] dataset and classified into two sets according to the output of Dδ. Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4RL: datasets for deep data-driven reinforcement learning. Co RR, abs/2004.07219, 2020. |
| Dataset Splits | No | The paper mentions using the D4RL dataset but does not provide specific details on how it was split into training, validation, and test sets for their experiments, or cite a standard split. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the MuJoCo simulator and various RL algorithms (PPO, Ca DM, MAPLE, CQL) but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We alter the simulator gravity to generate different dynamics in online experiments. Possible values of gravity are {1.0}, {0.7,1.0,1.3}, and {0.4,0.7,1.0,1.3,1.6} in experiments with 1, 3, and 5 kinds of different dynamics, respectively. We set ρ = 0.5 in offline experiments with medium-expert level of data. ρ = 0.2 is set in all other experiments. λ is regarded as a hyperparameter with values 0.1 or 0.3. |