No-Regret Reinforcement Learning in Smooth MDPs

Authors: Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restelli

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental in this subsection, we empirically show, on an illustrative problem, that the use of orthogonal features has beneficial effects on learning performance. We employ two modified versions of the LQR, in which the state, after the linear dynamic transition, is pushed towards the origin in a way that prevents it from escaping from a given compact set. Precisely, using the same formalism of the LQR, we have: sh 1 gp Ash Bah ξhq, rh s J h Qsh a J h Rah, where gpxq : x 1 }x}2 and ξh is a Gaussian noise. As the support of the Gaussian distribution is the full Rd, after applying gp q, the possible set of new states is the ball of radius one. We performed two experiments with different parameter values and with horizon H 20, whose details can be found in the appendix C. In Figure 1, we can see plots showing the episodic return of the algorithms as a function of the number of learning episodes.
Researcher Affiliation Academia 1Politecnico di Milano, Milan, Italy.
Pseudocode No The paper refers to algorithms by name (LEGENDRE-ELEANOR, LEGENDRE-LSVI) and notes they build on others (ELEANOR, LSVI-UCB), but does not provide their pseudocode within the main text.
Open Source Code No No explicit statement or link for open-source code for the described methodology was found.
Open Datasets No The paper uses
Dataset Splits No No explicit training, validation, or test dataset splits were mentioned for the experiments.
Hardware Specification Yes CPU: 88 INTEL(R) XEON(R) CPU E7-8880 V4 @ 2.20GHZ CPUS, RAM: 94,0 GB.
Software Dependencies No The algorithms were implemented in PYTHON3.7. Each experiment was executed using five random seeds (corresponding to the first five natural numbers), and the computations were distributed across five parallel processes using the JOBLIB library. The software versions are not fully specified for all key components (e.g., JOBLIB version is missing).
Experiment Setup Yes We performed two experiments with different parameter values and with horizon H 20, whose details can be found in the appendix C. In section 4.3, we have performed a numerical simulation on a modified version of the Linear Quadratic Regulator (LQR). Both environments took the form sh 1 gp Ash Bah ξhq, rh s J h Qsh a J h Rah, where gpxq : x 1 }x}2 . Moreover, in both cases the dimension of the state space corresponds to 2, and the one of the action space to 1. Also, we have in both cases ȷ Q 1 0 0 1 what changes is the matrix A, which determines most of the dynamics of the system. For this matrix, we have Left experiment: A 0.7 0.7 0.7 0.7 ȷ Right experiment: A 0 1 1 0.