A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning
Authors: Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon Shaolei Du
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. In this paper, we present a reduction-based framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, we improve the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP, through a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. For upper bounds, in multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP. |
| Researcher Affiliation | Collaboration | Yunchang Yang Center for Data Science, Peking University yangyc@pku.edu.cn Tianhao Wu University of California, Berkeley thw@berkeley.edu Han Zhong Center for Data Sience, Peking University hanzhong@stu.pku.edu.cn Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric Facebook AI Research {evrard, pirotta, lazaric}@fb.com Liwei Wang Key Laboratory of Machine Perception, MOE, School of Artificial Intelligence, Peking University International Center for Machine Learning Research, Peking University wanglw@cis.pku.edu.cn Simon S. Du University of Washington ssdu@cs.washington.edu |
| Pseudocode | Yes | Algorithm 1: Budget-Exporation; Algorithm 2: Lower Confidence Bound for Conservative Exploration |
| Open Source Code | No | The paper does not provide any links to open-source code or state that code is made available. |
| Open Datasets | No | This paper is theoretical, focusing on mathematical bounds and algorithms, and does not conduct experiments on datasets. Therefore, it does not refer to publicly available datasets with access information. |
| Dataset Splits | No | This paper is theoretical, focusing on mathematical bounds and algorithms, and does not conduct experiments on datasets. Therefore, it does not specify training/test/validation dataset splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any experimental hardware used. |
| Software Dependencies | No | The paper is theoretical and does not list specific software dependencies with version numbers for experimental reproducibility. |
| Experiment Setup | No | The paper is theoretical and does not describe specific experimental setup details like hyperparameters or training configurations. |