Offline RL with Observation Histories: Analyzing and Improving Sample Complexity
Authors: Joey Hong, Anca Dragan, Sergey Levine
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental evaluation aims to empirically analyze the relationship between the performance of offline RL in partially observed settings and the bisimulation loss we discussed in Section 6. Our hypothesis is that, if na ıve offline RL performs poorly on a given POMDP, then adding the bisimulation loss should improve performance, and if offline RL already does well, then the learned representations should already induce a bisimulation metric, and thus a low value of this loss. Note that our theory does not state that na ıve offline RL will always perform poorly, just that it has a poor worst-case bound, so we would not expect an explicit bisimulation loss to always be necessary, though we hypothesize that successful offline RL runs might still minimize loss as a byproduct of successful learning when they work well. We describe the main elements of each evaluation in the main paper, and defer implementation details to Appendix B. 7.1 TABULAR NAVIGATION |
| Researcher Affiliation | Academia | Joey Hong Anca Dragan Sergey Levine UC Berkeley {joey hong,anca,sergey.levine}@berkeley.edu |
| Pseudocode | Yes | Algorithm 1 Offline RL with Bisimulation Learning |
| Open Source Code | No | The paper does not explicitly state that source code is open-sourced or provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | We use a dataset of Wordle games played by real humans and scraped from tweets, which was originally compiled and processed by Snell et al. (2023). |
| Dataset Splits | No | The paper mentions dataset creation and sizes but does not specify training, validation, or test splits with percentages or counts. |
| Hardware Specification | Yes | All algorithms were trained on a single V100 GPU until convergence, which took less than 3 days. |
| Software Dependencies | No | The paper mentions GPT-2 and Adam W but does not provide specific version numbers for any software dependencies or libraries. |
| Experiment Setup | Yes | We use the hyperparameters reported in Table 3. |