Towards Deployment-Efficient Reinforcement Learning: Lower Bound and Optimality

Authors: Jiawei Huang, Jinglin Chen, Li Zhao, Tao Qin, Nan Jiang, Tie-Yan Liu

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper, we propose such a formulation for deployment-efficient RL (DE-RL) from an optimization with constraints perspective: we are interested in exploring an MDP and obtaining a near-optimal policy within minimal deployment complexity, whereas in each deployment the policy can sample a large batch of data. Using finite-horizon linear MDPs as a concrete structural model, we reveal the fundamental limit in achieving deployment efficiency by establishing information-theoretic lower bounds, and provide algorithms that achieve the optimal deployment efficiency.
Researcher Affiliation Collaboration Jiawei Huang :, Jinglin Chen:, Li Zhao;, Tao Qin;, Nan Jiang:, Tie-Yan Liu; : Department of Computer Science, University of Illinois at Urbana-Champaign {jiaweih, jinglinc, nanjiang}@illinois.edu ; Microsoft Research Asia {lizo, taoqin, tyliu}@microsoft.com
Pseudocode Yes Algorithm 1: Layer-by-Layer Batch Exploration Strategy for Linear MDPs Given Reward Function... Algorithm 2: Deployment-Efficient RL with Covariance Matrix Estimation
Open Source Code No The paper does not contain any explicit statements about releasing source code or links to a code repository for the methodology described.
Open Datasets No The paper is theoretical and focuses on mathematical formulations, lower bounds, and algorithms for linear MDPs, which are theoretical models. It does not describe experiments using empirical datasets for training.
Dataset Splits No The paper is theoretical and does not present empirical experiments that would require dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any computational experiments that would require specific hardware for execution. Therefore, no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and focuses on algorithm design and proofs. It does not specify any software dependencies with version numbers for implementation or experimentation.
Experiment Setup No The paper is theoretical and presents algorithms (Algorithm 1 and 2) with general parameters (e.g., 'β'), but it does not describe a specific experimental setup with concrete hyperparameter values or training configurations for empirical runs.