reproducibilityindex.ai

Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

Authors: Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, Tie-Yan Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	In this paper, we propose such a formulation for deployment-efﬁcient RL (DE-RL) from an optimization with constraints perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal deployment complexity, whereas in each deployment the policy can sample a large batch of data. Using ﬁnite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efﬁciency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efﬁciency.
Researcher Affiliation	Collaboration	Jiawei Huang :, Jinglin Chen:, Li Zhao;, Tao Qin;, Nan Jiang:, Tie-Yan Liu; : Department of Computer Science, University of Illinois at Urbana-Champaign {jiaweih, jinglinc, nanjiang}@illinois.edu ; Microsoft Research Asia {lizo, taoqin, tyliu}@microsoft.com
Pseudocode	Yes	Algorithm 1: Layer-by-Layer Batch Exploration Strategy for Linear MDPs Given Reward Function... Algorithm 2: Deployment-Efﬁcient RL with Covariance Matrix Estimation
Open Source Code	No	The paper does not contain any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets	No	The paper is theoretical and focuses on mathematical formulations, lower bounds, and algorithms for linear MDPs, which are theoretical models. It does not describe experiments using empirical datasets for training.
Dataset Splits	No	The paper is theoretical and does not present empirical experiments that would require dataset splits for training, validation, or testing.
Hardware Specification	No	The paper is theoretical and does not describe any computational experiments that would require specific hardware for execution. Therefore, no hardware specifications are mentioned.
Software Dependencies	No	The paper is theoretical and focuses on algorithm design and proofs. It does not specify any software dependencies with version numbers for implementation or experimentation.
Experiment Setup	No	The paper is theoretical and presents algorithms (Algorithm 1 and 2) with general parameters (e.g., 'β'), but it does not describe a specific experimental setup with concrete hyperparameter values or training configurations for empirical runs.