Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems
Authors: Carlo D'Eramo, Alessandro Nuara, Matteo Pirotta, Marcello Restelli
AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches.In this section we evaluate the performance of ME, DE and WE on three sequential decision-making problems: one Multi-Armed Bandit (MAB) problem and an MDP with both finite and continuous actions. |
| Researcher Affiliation | Academia | Carlo D Eramo, Alessandro Nuara, Matteo Pirotta, Marcello Restelli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci, 32, 20133, Milano, Italy carlo.deramo@polimi.it, alessandro.nuara@mail.polimi.it, matteo.pirotta@polimi.it, marcello.restelli@polimi.it |
| Pseudocode | Yes | Algorithm 1 Double FQI, Algorithm 2 Weighted FQI (finite actions), Algorithm 3 Weighted FQI (continuous actions) |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper describes generating samples for the Pricing Problem and collecting training sets using a random policy for the Swing-up Pendulum, but does not provide access information or citations for a publicly available dataset. |
| Dataset Splits | No | The paper mentions collecting training sets and evaluating performance on different initial conditions, but does not provide specific train/validation/test dataset splits with percentages, counts, or references to predefined splits. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions using Gaussian Process regression but does not specify any software libraries or dependencies with version numbers. |
| Experiment Setup | Yes | Results are averaged on 50 runs in order to show confidence intervals at 95%.The GP uses a squared exponential kernel with independent length scale for each input dimension (ARD SE). The hyperparameters are fitted on the samples and the input values are normalized between [ 1, 1]. ... The FQI horizon is 10 iterations. |