No-Regret Reinforcement Learning in Smooth MDPs
Authors: Davide Maran, Alberto Maria Metelli, Matteo Papini, Marcello Restelli
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | in this subsection, we empirically show, on an illustrative problem, that the use of orthogonal features has beneficial effects on learning performance. We employ two modified versions of the LQR, in which the state, after the linear dynamic transition, is pushed towards the origin in a way that prevents it from escaping from a given compact set. Precisely, using the same formalism of the LQR, we have: sh 1 gp Ash Bah ξhq, rh s J h Qsh a J h Rah, where gpxq : x 1 }x}2 and ξh is a Gaussian noise. As the support of the Gaussian distribution is the full Rd, after applying gp q, the possible set of new states is the ball of radius one. We performed two experiments with different parameter values and with horizon H 20, whose details can be found in the appendix C. In Figure 1, we can see plots showing the episodic return of the algorithms as a function of the number of learning episodes. |
| Researcher Affiliation | Academia | 1Politecnico di Milano, Milan, Italy. |
| Pseudocode | No | The paper refers to algorithms by name (LEGENDRE-ELEANOR, LEGENDRE-LSVI) and notes they build on others (ELEANOR, LSVI-UCB), but does not provide their pseudocode within the main text. |
| Open Source Code | No | No explicit statement or link for open-source code for the described methodology was found. |
| Open Datasets | No | The paper uses |
| Dataset Splits | No | No explicit training, validation, or test dataset splits were mentioned for the experiments. |
| Hardware Specification | Yes | CPU: 88 INTEL(R) XEON(R) CPU E7-8880 V4 @ 2.20GHZ CPUS, RAM: 94,0 GB. |
| Software Dependencies | No | The algorithms were implemented in PYTHON3.7. Each experiment was executed using five random seeds (corresponding to the first five natural numbers), and the computations were distributed across five parallel processes using the JOBLIB library. The software versions are not fully specified for all key components (e.g., JOBLIB version is missing). |
| Experiment Setup | Yes | We performed two experiments with different parameter values and with horizon H 20, whose details can be found in the appendix C. In section 4.3, we have performed a numerical simulation on a modified version of the Linear Quadratic Regulator (LQR). Both environments took the form sh 1 gp Ash Bah ξhq, rh s J h Qsh a J h Rah, where gpxq : x 1 }x}2 . Moreover, in both cases the dimension of the state space corresponds to 2, and the one of the action space to 1. Also, we have in both cases ȷ Q 1 0 0 1 what changes is the matrix A, which determines most of the dynamics of the system. For this matrix, we have Left experiment: A 0.7 0.7 0.7 0.7 ȷ Right experiment: A 0 1 1 0. |