Robust Reinforcement Learning: A Case Study in Linear Quadratic Regulation

Authors: Bo Pang, Zhong-Ping Jiang9303-9311

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on a numerical example validate our results. ... We apply O-LSPI to the LQR problem studied in (Krauth, Tu, and Recht 2019) ... To investigate the performance of the algorithm with different values of M and T, we conducted two sets of experiments: (a) Fix N = 5 and T = 45, and implement Algorithm 1 with increasing values of M from 200 to 10^6; (b) Fix N = 5 and M = 10^6, and implement Algorithm 1 with increasing values of T from 2 to 45. ... In Figure 1, as the number of rollout M increases, the fraction of stability becomes one, and both the sample average and sample variance of relative error converge to zero.
Researcher Affiliation Academia Bo Pang, Zhong-Ping Jiang Department of Electrical and Computer Engineering, New York University, Six Metrotech Center, Brooklyn, NY 11201 {bo.pang, zjiang}@nyu.edu
Pseudocode Yes Procedure 1 (Exact Policy Iteration). ... Procedure 2 (Inexact Policy Iteration). ... Algorithm 1: O-LSPI
Open Source Code No The paper does not provide a link or explicit statement about releasing the source code for the methodology described in the paper. It mentions that "Bo Pang thanks Dr. Stephen Tu for sharing the code of the least-squares policy iteration algorithms in (Krauth, Tu, and Recht 2019)", which refers to external code, not their own.
Open Datasets No The paper describes setting up a specific LQR problem with defined system matrices (A, B, C, S, R) and then collecting data from this simulated system. It does not use a pre-existing, publicly available dataset in the typical sense of a collection of empirical observations, nor does it provide concrete access information for the synthetically generated data.
Dataset Splits No The paper does not provide specific information about training, test, or validation dataset splits. It describes how data is collected during the experiment (e.g., using a behavior policy), but not how a pre-existing dataset is partitioned into these splits.
Hardware Specification Yes All the experiments are conducted using MATLAB1 2017b, on the New York University High Performance Computing Cluster Prince with 4 CPUs and 16GB Memory.
Software Dependencies Yes All the experiments are conducted using MATLAB1 2017b, on the New York University High Performance Computing Cluster Prince with 4 CPUs and 16GB Memory.
Experiment Setup Yes This yields N = 5, T = 45 and M = 10^6. To investigate the performance of the algorithm with different values of M and T, we conducted two sets of experiments: (a) Fix N = 5 and T = 45, and implement Algorithm 1 with increasing values of M from 200 to 10^6; (b) Fix N = 5 and M = 10^6, and implement Algorithm 1 with increasing values of T from 2 to 45. ... The exploration variance is set to σ^2 u = 1. All the experiments are conducted using MATLAB1 2017b, on the New York University High Performance Computing Cluster Prince with 4 CPUs and 16GB Memory.