reproducibilityindex.ai

RAMBO-RL: Robust Adversarial Model-Based Offline Reinforcement Learning

Authors: Marc Rigter, Bruno Lacerda, Nick Hawes

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In our experiments we demonstrate that RAMBO outperforms current state-of-the-art algorithms on the D4RL benchmarks [14]. Furthermore, we provide ablation results which show that training the model adversarially is crucial to the strong performance of RAMBO.
Researcher Affiliation	Academia	Marc Rigter, Bruno Lacerda, Nick Hawes Oxford Robotics Institute University of Oxford {mrigter, bruno, nickh}@robots.ox.ac.uk
Pseudocode	Yes	Algorithm 1 RAMBO-RL Require: Normalised dataset, D; 1: b Tφ MLE dynamics model. 2: for i = 1, 2, . . . , niter do 3: Generate synthetic k-step rollouts. Add transition data to D b Tφ. 4: Agent update: Update π and Qπ φ with an actor critic algorithm, using samples from D D b Tφ. 5: Adversarial model update: Update b Tφ according to Eq. 9, using samples from D for the MLE component, and the current critic Qπ φ and synthetic data sampled from π and b Tφ for the adversarial component.
Open Source Code	Yes	The code for our experiments is available at github.com/marc-rigter/rambo.
Open Datasets	Yes	We evaluate our approach on the following domains. Mu Jo Co There are three different environments representing different robots (Half Cheetah, Hopper, Walker2D), each with 4 datasets (Random, Medium, Medium-Replay, Medium-Expert). Ant Maze The agent controls a robot and navigates to reach a goal, receiving a sparse reward only if the goal is reached. The Mu Jo Co and Ant Maze benchmarks are from D4RL [14]. D is defined as a fixed dataset of transitions from the MDP, D = {(si, ai, ri, s i)}\|D\| i=1.
Dataset Splits	Yes	Evaluation We present two different evaluations of our approach: RAMBO and RAMBOOFF. For RAMBO, we ran each of the three hyperparameter conﬁgurations for ﬁve seeds each, and report the best performance across the three conﬁgurations. Thus, our evaluation of RAMBO utilises limited online tuning which is the most common practice among existing model-based ofﬂine RL algorithms [25, 37, 39, 76]. The performance obtained for each of the hyperparameter conﬁgurations is included in Appendix C.2. Ofﬂine hyperparameter selection is an important topic in ofﬂine RL [47, 77]. Therefore, we present additional results for RAMBOOFF where we select between the three choices of hyperparameters ofﬂine using a simple heuristic (details in Appendix B.5) based on the magnitude and stability of the Q-values during ofﬂine training.
Hardware Specification	Yes	Details of compute used per run and total compute used for full evaluation is in Appendix B.8.
Software Dependencies	No	The paper mentions 'soft actor-critic (SAC) [19]' for agent training and 'Adam [26]' as an optimizer, but it does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Hyperparameter Details The base hyperparameters that we use for RAMBO mostly follow those used in SAC [19] and COMBO [75]. We find that the performance of RAMBO is sensitive to the choice of rollout length, k, consistent with ﬁndings in previous works [22, 37]. The other critical parameter for RAMBO is the choice of the adversarial weighting, λ. For each dataset, we choose the rollout length and the adversarial weighting from one of three possible conﬁgurations: (k, λ) {(2, 3e-4), (5, 3e-4), (5, 0)}.