RORL: Robust Offline Reinforcement Learning via Conservative Smoothing
Authors: Rui Yang, Chenjia Bai, Xiaoteng Ma, Zhaoran Wang, Chongjie Zhang, Lei Han
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that RORL can achieve state-of-the-art performance on the general offline RL benchmark and is considerably robust to adversarial observation perturbations. In our experiments 3, we demonstrate that RORL can achieve state-of-the-art (SOTA) performance in the D4RL benchmark [12] with fewer ensemble Q networks than the current SOTA approach [2]. |
| Researcher Affiliation | Collaboration | Rui Yang1 , Chenjia Bai2 , Xiaoteng Ma3, Zhaoran Wang4, Chongjie Zhang3, Lei Han5 1Hong Kong University of Science and Technology, 2Shanghai AI Laboratory 3Tsinghua University, 4Northwestern University, 5Tencent Robotics X |
| Pseudocode | Yes | Figure 3: RORL Algorithm: RORL trains multiple Q-functions for uncertainty quantification. The conservative smoothing loss is calculated for (ˆs, a) with perturbed states. We perform uncertainty penalization for (ˆs, ˆa) with perturbed states and OOD actions. Algorithm 1: RORL Algorithm |
| Open Source Code | Yes | Our code is available at https://github.com/YangRui2015/RORL |
| Open Datasets | Yes | We evaluate our method on the D4RL benchmark [12] with various continuous-control tasks and datasets. ...We cited D4RL [12] and EDAC[2] for their datasets and code. |
| Dataset Splits | No | The paper evaluates on D4RL datasets and refers to an appendix for more hyper-parameters and implementation details, but it does not explicitly state dataset splits (e.g., percentages or counts for train, validation, and test sets) in the provided text. |
| Hardware Specification | Yes | We compare the computational cost of RORL with prior works on a single machine with one GPU (Tesla V100 32G). |
| Software Dependencies | No | The paper mentions using a '2-layer MLP' and 'Adam optimizer' but does not specify version numbers for any software libraries, frameworks, or programming languages used. |
| Experiment Setup | Yes | For benchmark experiments, we set small perturbation scales P, Q, and ood within {0.001, 0.005, 0.01} when training RORL and do not include observation perturbation in the testing time. We set the learning rates to 3e-4. The network structures are 2-layer MLP with 256 hidden units. We use Adam optimizer for all networks. |