RORL: Robust Offline Reinforcement Learning via Conservative Smoothing

Authors: Rui Yang, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, Lei Han

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations. In our experiments 3, we demonstrate that RORL can achieve state-of-the-art (SOTA) performance in the D4RL benchmark [12] with fewer ensemble Q networks than the current SOTA approach [2].
Researcher Affiliation Collaboration Rui Yang1 , Chenjia Bai2 , Xiaoteng Ma3, Zhaoran Wang4, Chongjie Zhang3, Lei Han5 1Hong Kong University of Science and Technology, 2Shanghai AI Laboratory 3Tsinghua University, 4Northwestern University, 5Tencent Robotics X
Pseudocode Yes Figure 3: RORL Algorithm: RORL trains multiple Q-functions for uncertainty quantification. The conservative smoothing loss is calculated for (ˆs, a) with perturbed states. We perform uncertainty penalization for (ˆs, ˆa) with perturbed states and OOD actions. Algorithm 1: RORL Algorithm
Open Source Code Yes Our code is available at https://github.com/YangRui2015/RORL
Open Datasets Yes We evaluate our method on the D4RL benchmark [12] with various continuous-control tasks and datasets. ...We cited D4RL [12] and EDAC[2] for their datasets and code.
Dataset Splits No The paper evaluates on D4RL datasets and refers to an appendix for more hyper-parameters and implementation details, but it does not explicitly state dataset splits (e.g., percentages or counts for train, validation, and test sets) in the provided text.
Hardware Specification Yes We compare the computational cost of RORL with prior works on a single machine with one GPU (Tesla V100 32G).
Software Dependencies No The paper mentions using a '2-layer MLP' and 'Adam optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used.
Experiment Setup Yes For benchmark experiments, we set small perturbation scales P, Q, and ood within {0.001, 0.005, 0.01} when training RORL and do not include observation perturbation in the testing time. We set the learning rates to 3e-4. The network structures are 2-layer MLP with 256 hidden units. We use Adam optimizer for all networks.