Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage

Authors: Masatoshi Uehara, Nathan Kallus, Jason D. Lee, Wen Sun

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this work, we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of the soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms and analyses to accurately estimate either soft or vanilla Q-functions with strong L2-convergence guarantees. Our algorithms loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
Researcher Affiliation Collaboration Masatoshi Uehara Genentech uehara.masatoshi@gene.com Nathan Kallus Cornell University kallus@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu Wen Sun Cornell University ws455@cornell.edu
Pseudocode Yes Algorithm 1 MSQP (Minimax Soft-Q-learning with Penalization) and Algorithm 2 MQP (Minimax Q -learning with Penalization) are provided.
Open Source Code No The paper does not provide any statements or links indicating the availability of open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not mention using any specific publicly available datasets for training. It refers to generic 'offline data D = {(si, ai, ri, s i) : i = 1, . . . , n}'.
Dataset Splits No The paper is theoretical and does not describe any dataset splits (training, validation, test) for experimental reproduction.
Hardware Specification No The paper is theoretical and does not mention any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not specify any software names with version numbers that would be required to reproduce the work.
Experiment Setup No The paper is theoretical and does not describe a concrete experimental setup with specific hyperparameters or training configurations.