Confident Approximate Policy Iteration for Efficient Local Planning in $q^\pi$-realizable MDPs
Authors: Gellért Weisz, András György, Tadashi Kozuno, Csaba Szepesvari
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We consider approximate dynamic programming in γ-discounted Markov decision processes and apply it to approximate planning with linear value-function approximation. Our first contribution is a new variant of APPROXIMATE POLICY ITERATION (API), called CONFIDENT APPROXIMATE POLICY ITERATION (CAPI), which computes a deterministic stationary policy with an optimal error bound scaling linearly with the product of the effective horizon H and the worstcase approximation error ε of the action-value functions of stationary policies. This improvement over API (whose error scales with H2) comes at the price of an H-fold increase in memory cost. Unlike Scherrer and Lesner [2012], who recommended computing a non-stationary policy to achieve a similar improvement (with the same memory overhead), we are able to stick to stationary policies. This allows for our second contribution, the application of CAPI to planning with local access to a simulator and d-dimensional linear function approximation. As such, we design a planning algorithm that applies CAPI to obtain a sequence of policies with successively refined accuracies on a dynamically evolving set of states. The algorithm outputs an O(dHE)-optimal policy after issuing O(dH4/E2) queries to the simulator, simultaneously achieving the optimal accuracy bound and the best known query complexity bound, while earlier algorithms in the literature achieve only one of them. This query complexity is shown to be tight in all parameters except H. These improvements come at the expense of a mild (polynomial) increase in memory and computational costs of both the algorithm and its output policy. |
| Researcher Affiliation | Collaboration | Gellért Weisz Deep Mind, London, UK University College London, London, UK; András György Deep Mind, London, UK; Tadashi Kozuno University of Alberta, Edmonton, Canada Omron Sinic X, Tokyo, Japan; Csaba Szepesvári Deep Mind, London, UK University of Alberta, Edmonton, Canada |
| Pseudocode | Yes | Algorithm 1 APPROXIMATE POLICY ITERATION (API) and CONFIDENT APPROXIMATE POLICY ITERATION (CAPI); Algorithm 2 MEASURE; Algorithm 3 CAPI-QPI-PLAN |
| Open Source Code | No | The paper states 'N/A' for questions about code, data, and instructions for reproducing experimental results in the 'Checklist' section, indicating no open-source code is provided. |
| Open Datasets | No | The paper is theoretical and does not conduct empirical studies with datasets. Thus, there is no mention of publicly available or open datasets for training. |
| Dataset Splits | No | The paper is theoretical and does not conduct empirical studies with datasets. Thus, there is no mention of training/validation/test splits. |
| Hardware Specification | No | The paper states 'N/A' for questions about hardware specifications in the 'Checklist' section, and no specific hardware used for experiments is mentioned elsewhere. |
| Software Dependencies | No | The paper states 'N/A' for questions about software dependencies in the 'Checklist' section, and no specific software dependencies with version numbers are mentioned elsewhere. |
| Experiment Setup | No | The paper is theoretical and does not conduct experiments. Therefore, it does not provide details on experimental setup such as hyperparameters or system-level training settings. |