Least-Squares Temporal Difference Learning for the Linear Quadratic Regulator

Authors: Stephen Tu, Benjamin Recht

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct numerical experiments on LSTD for value function estimation, and Least-Squares Policy Iteration (LSPI) for an end-to-end comparison with the model-based methods in Dean et al. (2017). Our implementation is carried out in Python using numpy for linear algebraic computations and PyWren (Jonas et al., 2017) for parallelization. In our first set of experiments, we construct synthetic examples where we vary the condition number of the resulting closed-loop controllability gramian matrix. We find that on these instances, as the condition number increases, the required number of samples to estimate the value function to fixed relative error increases, as predicted by our result in Theorem 4.3. In our second set of experiments, we compare model-free policy iteration (LSPI) to two model-based methods: (a) the na ıve nominal model controller which uses a controller designed assuming that the nominal model has zero error, and (b) a controller based on a semidefinite relaxation to the non-convex robust control problem with static state-feedback. Our experiments show that model-free policy iteration requires more samples than model-based methods for the instances we consider.
Researcher Affiliation Academia Stephen Tu 1 Benjamin Recht 1, 1EECS Department, University of California, Berkeley. Correspondence to: Stephen Tu <stephent@berkeley.edu>.
Pseudocode No The paper describes algorithms and derivations in prose and mathematical notation but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states 'Our implementation is carried out in Python using numpy for linear algebraic computations and PyWren (Jonas et al., 2017) for parallelization.' but does not provide any statement or link about the availability of their source code.
Open Datasets No The paper uses 'synthetic examples' generated by specific parameters and processes (e.g., 'We collect M independent trajectories of the system (5.1) excited by independent Gaussian noise N(0, I3) of length N = 20.'), rather than a publicly available dataset with concrete access information.
Dataset Splits No The paper describes using prefixes of generated trajectories for evaluation ('For each trajectory, we take the first Np points for Np {100, 200, ..., 1000} and compute the LSTD estimator b PNp on the first Np data points.') but does not specify a train/validation/test split or cross-validation.
Hardware Specification No The paper mentions 'Our implementation is carried out in Python using numpy for linear algebraic computations and PyWren (Jonas et al., 2017) for parallelization.' but does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used for the experiments.
Software Dependencies Yes The paper mentions 'We solve the resulting SDPs using cvxpy (Diamond & Boyd, 2016) with MOSEK (2015).' and the reference for MOSEK specifies 'Version 7.1 (Revision 28).'
Experiment Setup Yes We consider several instances of LQR with n = 5, Q = R = 0.1I5, and γ = 0.9. For each configuration, we collect 100 trajectories of length N = 1000. For the purposes of comparison, we set K0 such that the closed loop matrix A + BK0 = diag(0.6, 0.6, 0.6).