Estimating the Maximum Expected Value in Continuous Reinforcement Learning Problems

Authors: Carlo D'Eramo, Alessandro Nuara, Matteo Pirotta, Marcello Restelli

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches.In this section we evaluate the performance of ME, DE and WE on three sequential decision-making problems: one Multi-Armed Bandit (MAB) problem and an MDP with both finite and continuous actions.
Researcher Affiliation Academia Carlo D Eramo, Alessandro Nuara, Matteo Pirotta, Marcello Restelli Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano Piazza Leonardo da Vinci, 32, 20133, Milano, Italy carlo.deramo@polimi.it, alessandro.nuara@mail.polimi.it, matteo.pirotta@polimi.it, marcello.restelli@polimi.it
Pseudocode Yes Algorithm 1 Double FQI, Algorithm 2 Weighted FQI (finite actions), Algorithm 3 Weighted FQI (continuous actions)
Open Source Code No The paper does not provide any explicit statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets No The paper describes generating samples for the Pricing Problem and collecting training sets using a random policy for the Swing-up Pendulum, but does not provide access information or citations for a publicly available dataset.
Dataset Splits No The paper mentions collecting training sets and evaluating performance on different initial conditions, but does not provide specific train/validation/test dataset splits with percentages, counts, or references to predefined splits.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies No The paper mentions using Gaussian Process regression but does not specify any software libraries or dependencies with version numbers.
Experiment Setup Yes Results are averaged on 50 runs in order to show confidence intervals at 95%.The GP uses a squared exponential kernel with independent length scale for each input dimension (ARD SE). The hyperparameters are fitted on the samples and the input values are normalized between [ 1, 1]. ... The FQI horizon is 10 iterations.