Estimating Maximum Expected Value through Gaussian Approximation

Authors: Carlo D’Eramo, Marcello Restelli, Alessandro Nuara

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare the proposed estimator with the other state-of-the-art methods both theoretically, by deriving upper bounds to the bias and the variance of the estimator, and empirically, by testing the performance on different sequential learning problems.In this section we empirically compare the performance of WE, ME, and DE on four sequential decision-making problems: two multi-armed bandit domains and two MDPs.
Researcher Affiliation Academia Carlo D Eramo CARLO.DERAMO@POLIMI.IT Alessandro Nuara ALESSANDRO.NUARA@MAIL.POLIMI.IT Marcello Restelli MARCELLO.RESTELLI@POLIMI.IT Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133 Milano
Pseudocode Yes Algorithm 1 Weighted Q-Learning
Open Source Code No The paper does not provide any specific links to open-source code or explicitly state that the code for the methodology is publicly available.
Open Datasets No The paper mentions using historical daily data of GBP/USD exchange rate from 09/22/1997 to 01/10/2005 for the Forex experiment, but it does not provide a specific link, DOI, repository name, or formal citation to make this dataset publicly accessible.
Dataset Splits No The paper does not explicitly provide details about training/validation/test dataset splits with specific percentages, sample counts, or references to predefined splits for reproduction.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, or detailed computer specifications) used to run the experiments.
Software Dependencies No The paper mentions algorithms like Q-Learning and UCB1, but does not provide specific software library names with version numbers (e.g., Python 3.x, PyTorch x.x) that would be needed to replicate the experiments.
Experiment Setup Yes Learning rate is αt(s, a) = 1/nt(s,a)0.8 where nt(s, a) is the current number of updates of that action value and the discount factor is γ = 0.95. In Double Q-Learning we use two learning rates αAt (s, a) = 1/nAt (s,a)0.8 and αBt (s, a) = 1/nBt (s,a)0.8 where nAt (s, a) and nBt (s, a) are respectively the number of times when table A and table B are updated. We use an ε-greedy policy with ε = 1/√n(s) where n(s) is the number of times the state s has been visited. During the training phase, we set learning rate α(s, a) = 1/n(s,a), discount factor γ = 0.8 and ε = 1/√n(s).