Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation
Authors: Dan Qiao, Yu-Xiang Wang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | We study the problem of deployment efficient reinforcement learning (RL) with linear function approximation under the reward-free exploration setting. This is a well-motivated problem because deploying new policies is costly in real-life RL applications. Under the linear MDP setting with feature dimension d and planning horizon H, we propose a new algorithm that collects at most e O( d2H5 ϵ2 ) trajectories within H deployments to identify ϵ-optimal policy for any (possibly data-dependent) choice of reward functions. To the best of our knowledge, our approach is the first to achieve optimal deployment complexity and optimal d dependence in sample complexity at the same time, even if the reward is known ahead of time. Our novel techniques include an exploration-preserving policy discretization and a generalized G-optimal experiment design, which could be of independent interest. Lastly, we analyze the related problem of regret minimization in low-adaptive RL and provide information-theoretic lower bounds for switching cost and batch complexity. |
| Researcher Affiliation | Academia | Dan Qiao Department of Computer Science, UCSB danqiao@ucsb.edu Yu-Xiang Wang Department of Computer Science, UCSB yuxiangw@cs.ucsb.edu |
| Pseudocode | Yes | Algorithm 1 Layer-by-layer Reward-Free Exploration via Experimental Design (Exploration) ... Algorithm 2 Find Near-Optimal Policy Given Reward Function (Planning) ... Algorithm 3 Estimation of V π(r) given exploration data (Estimate V) ... Algorithm 4 Estimation of Eπr(sh, ah) given exploration data (Estimate ER) |
| Open Source Code | No | The paper does not contain any explicit statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper is theoretical and does not conduct experiments involving datasets, training, or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not conduct experiments involving datasets or validation splits. |
| Hardware Specification | No | The paper is theoretical and does not describe any experiments that would require specific hardware. No hardware specifications are mentioned. |
| Software Dependencies | No | The paper is theoretical and does not describe any experiments that would require specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not describe empirical experiments or their setup, thus no hyperparameters or system-level training settings are provided. |