Robust Reinforcement Learning using Offline Data
Authors: Kishan Panaganti, Zaiyan Xu, Dileep Kalathil, Mohammad Ghavamzadeh
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI)... We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems. |
| Researcher Affiliation | Collaboration | Kishan Panaganti1, Zaiyan Xu1, Dileep Kalathil1, Mohammad Ghavamzadeh2 1Texas A&M University, 2Google Research. Emails: {kpb, zxu43, dileep.kalathil}@tamu.edu, ghavamza@google.com |
| Pseudocode | Yes | Algorithm 1 Robust Fitted Q-Iteration (RFQI) Algorithm |
| Open Source Code | Yes | We provide our code in github webpage https: //github.com/zaiyan-x/RFQI containing instructions to reproduce all results in this paper. |
| Open Datasets | Yes | Here, we demonstrate the robust performance of our RFQI algorithm by evaluating it on Cartpole and Hopper environments in Open AI Gym (Brockman et al., 2016). |
| Dataset Splits | No | The paper states it uses an 'offline dataset D' but does not provide specific details on how this dataset is split into training, validation, or test sets, nor does it specify any cross-validation setup. |
| Hardware Specification | Yes | All experiments were performed on a single machine with NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions using the 'stable-baselines3 library' but does not provide specific version numbers for this or any other software dependencies like Python or deep learning frameworks. |
| Experiment Setup | Yes | We use a feedforward neural network with 2 hidden layers and 256 neurons in each layer. We use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 0.0003 and a batch size of 256. For the Cartpole, we train for 1000 epochs. For the Hopper, we train for 100 epochs. We use the discount factor γ = 0.99 for all experiments. The radius of the uncertainty set is set to ρ = 0.1 for Cartpole experiments and ρ = 0.5 for Hopper experiments. |