reproducibilityindex.ai

The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning

Authors: Anya Sims, Cong Lu, Jakob Foerster, Yee Whye Teh

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we analyze this assumption and investigate how popular algorithms perform as the learned dynamics model is improved. We term this the edge-of-reach problem. Based on this new insight, we fill important gaps in existing theory, and reveal how prior model-based methods are primarily addressing the edge-of-reach problem, rather than modelinaccuracy as claimed. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and hence unlike existing methods does not fail as the dynamics model is improved. We find that existing offline model-based methods completely fail if the learned dynamics model is replaced with the true, error-free dynamics model, while keeping everything else the same (see Figure 1). In this section, we begin by analyzing RAVL on the simple environment from Section 4. Next, we look at the standard D4RL benchmark, first confirming that RAVL solves the failure seen with the true dynamics, before then demonstrating that RAVL achieves strong performance with the learned dynamics.
Researcher Affiliation	Academia	Anya Sims University of Oxford anya.sims@stats.ox.ac.uk Cong Lu University of Oxford Jakob N. Foerster FLAIR, University of Oxford Yee Whye Teh University of Oxford
Pseudocode	Yes	Algorithm 1 Base model-based algorithm (MBPO) + Additions in existing methods and RAVL (ours)
Open Source Code	Yes	Our code is open-sourced at: github.com/anyasims/edge-of-reach. We have open-sourced our code at https://anonymous.4open.science/ r/edge-of-reach-8096
Open Datasets	Yes	Results shown are for MOPO [36], but note that this failure indicates the failure of all existing uncertainty-based methods since each of their specific penalty terms disappear under the true dynamics as uncertainty is zero. By contrast, our method is much more robust to changes in dynamics model. The x-axis shows linearly interpolating next states and rewards of the learned model with the true model (center right) and random model (center left), with results on the D4RL W2d-medexp benchmark (min/max over 4 seeds). Table 1 shows results on the standard offline benchmark D4RL [7] Mu Jo Co [34] v2 datasets with the true (zero error, zero uncertainty) dynamics.
Dataset Splits	No	The paper uses standard D4RL datasets for its experiments, which typically come with predefined splits. However, the paper does not explicitly state the exact train/validation/test splits (e.g., percentages, sample counts, or specific split files) used for its own experimental setup within the main text or appendices.
Hardware Specification	Yes	Our algorithm takes on average 6 hours to run using a V100 GPU for the full number of epochs.
Software Dependencies	No	The paper mentions using a 'base model-based procedure' and that 'Our implementation is based on the Clean Offline Reinforcement Learning (CORL, Tarasov et al. [33]) repository', but it does not provide specific version numbers for key software components or libraries (e.g., PyTorch, TensorFlow, specific Python libraries) that would enable precise replication of the software environment.
Experiment Setup	Yes	For the D4RL [7] Mu Jo Co results presented in Table 2, we sweep over the following hyperparameters and list the choices used in Table 4. (EDAC) Number of Q-ensemble elements Ncritic, in the range {10, 50} (EDAC) Ensemble diversity weight η, in the range {1, 10, 100} (Base) Model rollout length k, in the range {1, 5} (Base) Real-to-synthetic data ratio r, in the range {0.05, 0.5}. The remaining model-based and agent hyperparameters are given in Table 5. Table 5: Fixed hyperparameters for RAVL used in D4RL Mu Jo Co locomotion tasks. Parameter Value epochs 3,000 for medexp; 1,000 for rest gamma 0.99 learning rate 3 10 4 batch size 256 buffer retain epochs 5 number of rollouts 50,000