The Edge-of-Reach Problem in Offline Model-Based Reinforcement Learning
Authors: Anya Sims, Cong Lu, Jakob Foerster, Yee Whye Teh
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we analyze this assumption and investigate how popular algorithms perform as the learned dynamics model is improved. We term this the edge-of-reach problem. Based on this new insight, we fill important gaps in existing theory, and reveal how prior model-based methods are primarily addressing the edge-of-reach problem, rather than modelinaccuracy as claimed. Finally, we propose Reach-Aware Value Learning (RAVL), a simple and robust method that directly addresses the edge-of-reach problem and hence unlike existing methods does not fail as the dynamics model is improved. We find that existing offline model-based methods completely fail if the learned dynamics model is replaced with the true, error-free dynamics model, while keeping everything else the same (see Figure 1). In this section, we begin by analyzing RAVL on the simple environment from Section 4. Next, we look at the standard D4RL benchmark, first confirming that RAVL solves the failure seen with the true dynamics, before then demonstrating that RAVL achieves strong performance with the learned dynamics. |
| Researcher Affiliation | Academia | Anya Sims University of Oxford anya.sims@stats.ox.ac.uk Cong Lu University of Oxford Jakob N. Foerster FLAIR, University of Oxford Yee Whye Teh University of Oxford |
| Pseudocode | Yes | Algorithm 1 Base model-based algorithm (MBPO) + Additions in existing methods and RAVL (ours) |
| Open Source Code | Yes | Our code is open-sourced at: github.com/anyasims/edge-of-reach. We have open-sourced our code at https://anonymous.4open.science/ r/edge-of-reach-8096 |
| Open Datasets | Yes | Results shown are for MOPO [36], but note that this failure indicates the failure of all existing uncertainty-based methods since each of their specific penalty terms disappear under the true dynamics as uncertainty is zero. By contrast, our method is much more robust to changes in dynamics model. The x-axis shows linearly interpolating next states and rewards of the learned model with the true model (center right) and random model (center left), with results on the D4RL W2d-medexp benchmark (min/max over 4 seeds). Table 1 shows results on the standard offline benchmark D4RL [7] Mu Jo Co [34] v2 datasets with the true (zero error, zero uncertainty) dynamics. |
| Dataset Splits | No | The paper uses standard D4RL datasets for its experiments, which typically come with predefined splits. However, the paper does not explicitly state the exact train/validation/test splits (e.g., percentages, sample counts, or specific split files) used for its own experimental setup within the main text or appendices. |
| Hardware Specification | Yes | Our algorithm takes on average 6 hours to run using a V100 GPU for the full number of epochs. |
| Software Dependencies | No | The paper mentions using a 'base model-based procedure' and that 'Our implementation is based on the Clean Offline Reinforcement Learning (CORL, Tarasov et al. [33]) repository', but it does not provide specific version numbers for key software components or libraries (e.g., PyTorch, TensorFlow, specific Python libraries) that would enable precise replication of the software environment. |
| Experiment Setup | Yes | For the D4RL [7] Mu Jo Co results presented in Table 2, we sweep over the following hyperparameters and list the choices used in Table 4. (EDAC) Number of Q-ensemble elements Ncritic, in the range {10, 50} (EDAC) Ensemble diversity weight η, in the range {1, 10, 100} (Base) Model rollout length k, in the range {1, 5} (Base) Real-to-synthetic data ratio r, in the range {0.05, 0.5}. The remaining model-based and agent hyperparameters are given in Table 5. Table 5: Fixed hyperparameters for RAVL used in D4RL Mu Jo Co locomotion tasks. Parameter Value epochs 3,000 for medexp; 1,000 for rest gamma 0.99 learning rate 3 10 4 batch size 256 buffer retain epochs 5 number of rollouts 50,000 |