A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning

Authors: Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon Shaolei Du

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. In this paper, we present a reduction-based framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, we improve the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP, through a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. For upper bounds, in multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.
Researcher Affiliation Collaboration Yunchang Yang Center for Data Science, Peking University yangyc@pku.edu.cn Tianhao Wu University of California, Berkeley thw@berkeley.edu Han Zhong Center for Data Sience, Peking University hanzhong@stu.pku.edu.cn Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric Facebook AI Research {evrard, pirotta, lazaric}@fb.com Liwei Wang Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University International Center for Machine Learning Research, Peking University wanglw@cis.pku.edu.cn Simon S. Du University of Washington ssdu@cs.washington.edu
Pseudocode Yes Algorithm 1: Budget-Exporation; Algorithm 2: Lower Confidence Bound for Conservative Exploration
Open Source Code No The paper does not provide any links to open-source code or state that code is made available.
Open Datasets No This paper is theoretical, focusing on mathematical bounds and algorithms, and does not conduct experiments on datasets. Therefore, it does not refer to publicly available datasets with access information.
Dataset Splits No This paper is theoretical, focusing on mathematical bounds and algorithms, and does not conduct experiments on datasets. Therefore, it does not specify training/test/validation dataset splits.
Hardware Specification No The paper is theoretical and does not describe any experimental hardware used.
Software Dependencies No The paper is theoretical and does not list specific software dependencies with version numbers for experimental reproducibility.
Experiment Setup No The paper is theoretical and does not describe specific experimental setup details like hyperparameters or training configurations.