reproducibilityindex.ai

A Reduction-Based Framework for Conservative Bandits and Reinforcement Learning

Authors: Yunchang Yang, Tianhao Wu, Han Zhong, Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric, Liwei Wang, Simon Shaolei Du

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, ﬁnance, etc. In this paper, we present a reduction-based framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufﬁcient budget obtained from running the baseline policy. For lower bounds, we improve the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP, through a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. For upper bounds, in multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with signiﬁcantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.
Researcher Affiliation	Collaboration	Yunchang Yang Center for Data Science, Peking University yangyc@pku.edu.cn Tianhao Wu University of California, Berkeley thw@berkeley.edu Han Zhong Center for Data Sience, Peking University hanzhong@stu.pku.edu.cn Evrard Garcelon, Matteo Pirotta, Alessandro Lazaric Facebook AI Research {evrard, pirotta, lazaric}@fb.com Liwei Wang Key Laboratory of Machine Perception, MOE, School of Artiﬁcial Intelligence, Peking University International Center for Machine Learning Research, Peking University wanglw@cis.pku.edu.cn Simon S. Du University of Washington ssdu@cs.washington.edu
Pseudocode	Yes	Algorithm 1: Budget-Exporation; Algorithm 2: Lower Conﬁdence Bound for Conservative Exploration
Open Source Code	No	The paper does not provide any links to open-source code or state that code is made available.
Open Datasets	No	This paper is theoretical, focusing on mathematical bounds and algorithms, and does not conduct experiments on datasets. Therefore, it does not refer to publicly available datasets with access information.
Dataset Splits	No	This paper is theoretical, focusing on mathematical bounds and algorithms, and does not conduct experiments on datasets. Therefore, it does not specify training/test/validation dataset splits.
Hardware Specification	No	The paper is theoretical and does not describe any experimental hardware used.
Software Dependencies	No	The paper is theoretical and does not list specific software dependencies with version numbers for experimental reproducibility.
Experiment Setup	No	The paper is theoretical and does not describe specific experimental setup details like hyperparameters or training configurations.