Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage
Authors: Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | In this work, we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of the soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms and analyses to accurately estimate either soft or vanilla Q-functions with strong L2-convergence guarantees. Our algorithms loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying. |
| Researcher Affiliation | Collaboration | Masatoshi Uehara Genentech uehara.masatoshi@gene.com Nathan Kallus Cornell University kallus@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wen Sun Cornell University ws455@cornell.edu |
| Pseudocode | Yes | Algorithm 1 MSQP (Minimax Soft-Q-learning with Penalization) and Algorithm 2 MQP (Minimax Q -learning with Penalization) are provided. |
| Open Source Code | No | The paper does not provide any statements or links indicating the availability of open-source code for the described methodology. |
| Open Datasets | No | The paper is theoretical and does not mention using any specific publicly available datasets for training. It refers to generic 'offline data D = {(si, ai, ri, s i) : i = 1, . . . , n}'. |
| Dataset Splits | No | The paper is theoretical and does not describe any dataset splits (training, validation, test) for experimental reproduction. |
| Hardware Specification | No | The paper is theoretical and does not mention any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not specify any software names with version numbers that would be required to reproduce the work. |
| Experiment Setup | No | The paper is theoretical and does not describe a concrete experimental setup with specific hyperparameters or training configurations. |