Bellman Residual Orthogonalization for Offline Reinforcement Learning
Authors: Andrea Zanette, Martin J Wainwright
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an user-defined space of test functions. Focusing on applications to model-free offline RL with function approximation, we exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a trade-off between the value and uncertainty of an arbitrary comparator policy. ... We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomial-time implementations even when Bellman closure does not hold. Also, the author's checklist under 'If you ran experiments...' has '[N/A]' for points 3a-3d. |
| Researcher Affiliation | Academia | Andrea Zanette Department of Computer Sciences and Electrical Engineering University of California, Berkeley zanette@berkeley.edu Martin J. Wainwright Department of Electrical Engineering and Computer Sciences, Department of Mathematics, Massachusetts Institute of Technology, and Department of Computer Sciences and Electrical Engineering, Department of Statistics, University of California, Berkeley wainwrigwork@gmail.com |
| Pseudocode | No | The paper describes the steps for its 'actor-critic method' in Section 5.2, listing them as numbered points (e.g., 'At each iteration t = 1, . . . , T, ...', 'Using the finite test function class (18)...', 'Using the action-value vector wt...'), but this is presented as descriptive text and not in a structured 'Algorithm' block or 'Pseudocode' format. |
| Open Source Code | No | In the 'Ethics Review' section, under 'If you ran experiments...', the points 3a ('Did you include the code, data, and instructions needed to reproduce the main experimental results...') and 4c ('Did you include any new assets either in the supplemental material or as a URL?') are marked as '[N/A]'. There is no explicit statement or link in the paper providing open-source code for the methodology. |
| Open Datasets | No | The paper introduces 'Assumption 1 (I.i.d. dataset)' to describe the data generation mechanism: 'An i.i.d. dataset is a collection D = {(si, ai, ri, s+i)n i=1 such that for each i = 1, . . . , n we have (si, ai, oi) µ and conditioned on (si, ai, oi), we observe a noisy reward ri = r(si, ai) + i with E[ i | Fi] = 0, |ri| 1 and the next state s+i P(si, ai).' However, it does not specify a named public dataset or provide access information (link, citation) for a publicly available or open dataset. |
| Dataset Splits | No | The paper is theoretical and does not describe experimental setups with data splits. The 'Ethics Review' section explicitly states '[N/A]' for questions regarding running experiments and training details (3b). |
| Hardware Specification | No | The paper does not provide any specific hardware details. In the 'Ethics Review' section, under 'If you ran experiments...', point 3d ('Did you include the total amount of compute and the type of resources used...') is marked '[N/A]', indicating no experiments with specific hardware were conducted or reported. |
| Software Dependencies | No | The paper is theoretical and does not describe any specific software dependencies with version numbers. In the 'Ethics Review' section, under 'If you ran experiments...', points 3a and 3b are marked '[N/A]', which cover code and training details. |
| Experiment Setup | No | The paper is theoretical and does not provide details about an experimental setup, hyperparameters, or system-level training settings. In the 'Ethics Review' section, under 'If you ran experiments...', point 3b ('Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?') is marked '[N/A]'. |