Robust Reinforcement Learning: A Case Study in Linear Quadratic Regulation
Authors: Bo Pang, Zhong-Ping Jiang9303-9311
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on a numerical example validate our results. ... We apply O-LSPI to the LQR problem studied in (Krauth, Tu, and Recht 2019) ... To investigate the performance of the algorithm with different values of M and T, we conducted two sets of experiments: (a) Fix N = 5 and T = 45, and implement Algorithm 1 with increasing values of M from 200 to 10^6; (b) Fix N = 5 and M = 10^6, and implement Algorithm 1 with increasing values of T from 2 to 45. ... In Figure 1, as the number of rollout M increases, the fraction of stability becomes one, and both the sample average and sample variance of relative error converge to zero. |
| Researcher Affiliation | Academia | Bo Pang, Zhong-Ping Jiang Department of Electrical and Computer Engineering, New York University, Six Metrotech Center, Brooklyn, NY 11201 {bo.pang, zjiang}@nyu.edu |
| Pseudocode | Yes | Procedure 1 (Exact Policy Iteration). ... Procedure 2 (Inexact Policy Iteration). ... Algorithm 1: O-LSPI |
| Open Source Code | No | The paper does not provide a link or explicit statement about releasing the source code for the methodology described in the paper. It mentions that "Bo Pang thanks Dr. Stephen Tu for sharing the code of the least-squares policy iteration algorithms in (Krauth, Tu, and Recht 2019)", which refers to external code, not their own. |
| Open Datasets | No | The paper describes setting up a specific LQR problem with defined system matrices (A, B, C, S, R) and then collecting data from this simulated system. It does not use a pre-existing, publicly available dataset in the typical sense of a collection of empirical observations, nor does it provide concrete access information for the synthetically generated data. |
| Dataset Splits | No | The paper does not provide specific information about training, test, or validation dataset splits. It describes how data is collected during the experiment (e.g., using a behavior policy), but not how a pre-existing dataset is partitioned into these splits. |
| Hardware Specification | Yes | All the experiments are conducted using MATLAB1 2017b, on the New York University High Performance Computing Cluster Prince with 4 CPUs and 16GB Memory. |
| Software Dependencies | Yes | All the experiments are conducted using MATLAB1 2017b, on the New York University High Performance Computing Cluster Prince with 4 CPUs and 16GB Memory. |
| Experiment Setup | Yes | This yields N = 5, T = 45 and M = 10^6. To investigate the performance of the algorithm with different values of M and T, we conducted two sets of experiments: (a) Fix N = 5 and T = 45, and implement Algorithm 1 with increasing values of M from 200 to 10^6; (b) Fix N = 5 and M = 10^6, and implement Algorithm 1 with increasing values of T from 2 to 45. ... The exploration variance is set to σ^2 u = 1. All the experiments are conducted using MATLAB1 2017b, on the New York University High Performance Computing Cluster Prince with 4 CPUs and 16GB Memory. |