RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning
Authors: Marc Rigter, Bruno Lacerda, Nick Hawes
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments we demonstrate that RAMBO outperforms current state-of-the-art algorithms on the D4RL benchmarks [14]. Furthermore, we provide ablation results which show that training the model adversarially is crucial to the strong performance of RAMBO. |
| Researcher Affiliation | Academia | Marc Rigter, Bruno Lacerda, Nick Hawes Oxford Robotics Institute University of Oxford {mrigter, bruno, nickh}@robots.ox.ac.uk |
| Pseudocode | Yes | Algorithm 1 RAMBO-RL Require: Normalised dataset, D; 1: b Tφ MLE dynamics model. 2: for i = 1, 2, . . . , niter do 3: Generate synthetic k-step rollouts. Add transition data to D b Tφ. 4: Agent update: Update π and Qπ φ with an actor critic algorithm, using samples from D D b Tφ. 5: Adversarial model update: Update b Tφ according to Eq. 9, using samples from D for the MLE component, and the current critic Qπ φ and synthetic data sampled from π and b Tφ for the adversarial component. |
| Open Source Code | Yes | The code for our experiments is available at github.com/marc-rigter/rambo. |
| Open Datasets | Yes | We evaluate our approach on the following domains. Mu Jo Co There are three different environments representing different robots (Half Cheetah, Hopper, Walker2D), each with 4 datasets (Random, Medium, Medium-Replay, Medium-Expert). Ant Maze The agent controls a robot and navigates to reach a goal, receiving a sparse reward only if the goal is reached. The Mu Jo Co and Ant Maze benchmarks are from D4RL [14]. D is defined as a fixed dataset of transitions from the MDP, D = {(si, ai, ri, s i)}|D| i=1. |
| Dataset Splits | Yes | Evaluation We present two different evaluations of our approach: RAMBO and RAMBOOFF. For RAMBO, we ran each of the three hyperparameter configurations for five seeds each, and report the best performance across the three configurations. Thus, our evaluation of RAMBO utilises limited online tuning which is the most common practice among existing model-based offline RL algorithms [25, 37, 39, 76]. The performance obtained for each of the hyperparameter configurations is included in Appendix C.2. Offline hyperparameter selection is an important topic in offline RL [47, 77]. Therefore, we present additional results for RAMBOOFF where we select between the three choices of hyperparameters offline using a simple heuristic (details in Appendix B.5) based on the magnitude and stability of the Q-values during offline training. |
| Hardware Specification | Yes | Details of compute used per run and total compute used for full evaluation is in Appendix B.8. |
| Software Dependencies | No | The paper mentions 'soft actor-critic (SAC) [19]' for agent training and 'Adam [26]' as an optimizer, but it does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Hyperparameter Details The base hyperparameters that we use for RAMBO mostly follow those used in SAC [19] and COMBO [75]. We find that the performance of RAMBO is sensitive to the choice of rollout length, k, consistent with findings in previous works [22, 37]. The other critical parameter for RAMBO is the choice of the adversarial weighting, λ. For each dataset, we choose the rollout length and the adversarial weighting from one of three possible configurations: (k, λ) {(2, 3e-4), (5, 3e-4), (5, 0)}. |