Learning Linear-Quadratic Regulators Efficiently with only $\sqrtT$ Regret

Authors: Alon Cohen, Tomer Koren, Yishay Mansour

ICML 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We present the first computationally-efficient algorithm with e O(T) regret for learning in Linear Quadratic Control systems with unknown dynamics. By that, we resolve an open question of Abbasi-Yadkori and Szepesv ari (2011) and Dean, Mania, Matni, Recht, and Tu (2018). In this paper, we give the first computationally-efficient algorithm that attains e O(T) regret for learning LQ systems, thus resolving the open problem of Abbasi-Yadkori and Szepesv ari (2011) and Dean et al. (2018). The key to the efficiency of our algorithm is in reformulating the LQ control problem as a convex semi-definite program. Our algorithm solves a sequence of semi-definite relaxations of the infinite horizon LQ problem, the solutions of which are used to compute optimistic policies for the underlying unknown LQ system. (Section 4) We now formally state our main result: a high-probability e O(T) regret bound for the efficient algorithm given in Algorithm 1. Theorem 4. Suppose that Algorithm 1 is initialized so that the initial estimation error (A0 B0) (A B ) 2 F ε satisfies 4λ = α5 0σ10 Assume T poly(n, ν, ϑ, α 1 0 , σ 1, x1 ). Then for any δ (0, 1), with probability at least 1 δ the regret of Algorithm 1 satisfies RT = O ν5n3ϑ Furthermore, the run-time per round of the procedure is polynomial in these factors.
Researcher Affiliation Collaboration 1Google Research, Tel-Aviv 2Technion Israel Inst. of Technology 3Google Brain, Mountain View 4Tel-Aviv University. Correspondence to: Alon Cohen <alon.cohen@technion.ac.il>.
Pseudocode Yes In this section we describe our efficient online algorithm for learning in LQRs; see pseudo-code in Algorithm 1. ... Algorithm 1 OSLO: Optimistic Sdp for Lq c Ontrol ... Algorithm 2 Warm-up
Open Source Code No The paper does not provide any statement about making its source code available, nor does it include a link to a code repository.
Open Datasets No The paper is a theoretical work focusing on algorithm design and analysis of regret bounds. It does not conduct empirical experiments using datasets, thus no dataset is mentioned as publicly available for training.
Dataset Splits No The paper does not describe empirical experiments with datasets, and therefore, no specific dataset split information for validation is provided.
Hardware Specification No The paper discusses theoretical runtime complexity ('run-time per round of the procedure is polynomial in these factors') but does not specify any particular hardware (GPU, CPU models, etc.) used for computations or experiments.
Software Dependencies No The paper does not mention any specific software dependencies or their version numbers that would be needed to replicate the theoretical analysis or algorithm design.
Experiment Setup No The paper is theoretical and does not describe empirical experiments. Therefore, it does not provide details on experimental setup such as hyperparameters or system-level training settings.