Pessimistic Minimax Value Iteration: Provably Efficient Equilibrium Learning from Offline Datasets

Authors: Han Zhong, Wei Xiong, Jiyuan Tan, Liwei Wang, Tong Zhang, Zhaoran Wang, Zhuoran Yang

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study episodic two-player zero-sum Markov games (MGs) in the offline setting, where the goal is to find an approximate Nash equilibrium (NE) policy pair based on a dataset collected a priori. We propose a pessimism-based algorithm, dubbed as pessimistic minimax value iteration (PMVI)... Furthermore, we establish a data-dependent upper bound on the suboptimality which recovers a sublinear rate without the assumption on uniform coverage of the dataset. We also prove an information-theoretical lower bound, which shows our upper bound is nearly minimax optimal, which suggests that the datadependent term is intrinsic. To the best of our knowledge, we provide the first nearly minimax optimal result for offline MGs with function approximation.
Researcher Affiliation Collaboration 1Center for Data Science, Peking University 2The Hong Kong University of Science and Technology 3Fudan University 4Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University 5Google Research 6Northwestern University 7Yale University.
Pseudocode Yes Algorithm 1: Pessimistic Minimax Value Iteration
Open Source Code No The paper does not provide any explicit statement or link regarding the availability of open-source code.
Open Datasets No The paper describes a theoretical "Offline Data Collecting Process" and uses a conceptual "dataset D" for proofs, but does not refer to a specific publicly available dataset used for training actual models.
Dataset Splits No The paper is theoretical and does not conduct experiments with dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any experimental setup or the hardware used for computations.
Software Dependencies No The paper is theoretical and does not describe any specific software dependencies with version numbers for implementation.
Experiment Setup No The paper is theoretical and focuses on algorithm design and proofs, thus it does not include details on experimental setup like hyperparameters or training configurations.