Provably Efficient CVaR RL in Low-rank MDPs

Authors: Yulai Zhao, Wenhao Zhan, Xiaoyan Hu, Ho-fung Leung, Farzan Farnia, Wen Sun, Jason D. Lee

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We prove that our algorithm achieves a sample complexity of O H7A2d4 τ 2ϵ2 to yield an ϵ-optimal CVa R, where H is the length of each episode, A is the capacity of action space, and d is the dimension of representations. Computational-wise, we design a novel discretized Least-Squares Value Iteration (LSVI) algorithm for the CVa R objective as the planning oracle and show that we can find the near-optimal policy in a polynomial running time with a Maximum Likelihood Estimation oracle. To our knowledge, this is the first provably efficient CVa R RL algorithm in low-rank MDPs.
Researcher Affiliation Academia Yulai Zhao Princeton University yulaiz@princeton.edu Wenhao Zhan Princeton University wenhao.zhan@princeton.edu Xiaoyan Hu The Chinese University of Hong Kong xyhu21@cse.cuhk.edu.hk Ho-fung Leung Independent Researcher ho-fung.leung@outlook.com Farzan Farnia The Chinese University of Hong Kong farnia@cse.cuhk.edu.hk Wen Sun Cornell University ws455@cornell.edu Jason D. Lee Princeton University jasonlee@princeton.edu
Pseudocode Yes Algorithm 1 ELA and Algorithm 3 ELLA are provided as structured pseudocode blocks.
Open Source Code No The paper is theoretical and does not mention releasing open-source code for the described methodology.
Open Datasets No The paper is theoretical and does not specify the use of any publicly available datasets for training or evaluation.
Dataset Splits No The paper is theoretical and does not specify training, validation, or test dataset splits.
Hardware Specification No The paper is theoretical and does not describe specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not provide specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not describe a concrete experimental setup with hyperparameter values or training configurations.