reproducibilityindex.ai

Estimating Maximum Expected Value through Gaussian Approximation

Authors: Carlo D’Eramo, Marcello Restelli, Alessandro Nuara

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare the proposed estimator with the other state-of-the-art methods both theoretically, by deriving upper bounds to the bias and the variance of the estimator, and empirically, by testing the performance on different sequential learning problems.In this section we empirically compare the performance of WE, ME, and DE on four sequential decision-making problems: two multi-armed bandit domains and two MDPs.
Researcher Affiliation	Academia	Carlo D Eramo CARLO.DERAMO@POLIMI.IT Alessandro Nuara ALESSANDRO.NUARA@MAIL.POLIMI.IT Marcello Restelli MARCELLO.RESTELLI@POLIMI.IT Politecnico di Milano, Piazza Leonardo da Vinci, 32, 20133 Milano
Pseudocode	Yes	Algorithm 1 Weighted Q-Learning
Open Source Code	No	The paper does not provide any specific links to open-source code or explicitly state that the code for the methodology is publicly available.
Open Datasets	No	The paper mentions using historical daily data of GBP/USD exchange rate from 09/22/1997 to 01/10/2005 for the Forex experiment, but it does not provide a specific link, DOI, repository name, or formal citation to make this dataset publicly accessible.
Dataset Splits	No	The paper does not explicitly provide details about training/validation/test dataset splits with specific percentages, sample counts, or references to predefined splits for reproduction.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, or detailed computer specifications) used to run the experiments.
Software Dependencies	No	The paper mentions algorithms like Q-Learning and UCB1, but does not provide specific software library names with version numbers (e.g., Python 3.x, PyTorch x.x) that would be needed to replicate the experiments.
Experiment Setup	Yes	Learning rate is αt(s, a) = 1/nt(s,a)0.8 where nt(s, a) is the current number of updates of that action value and the discount factor is γ = 0.95. In Double Q-Learning we use two learning rates αAt (s, a) = 1/nAt (s,a)0.8 and αBt (s, a) = 1/nBt (s,a)0.8 where nAt (s, a) and nBt (s, a) are respectively the number of times when table A and table B are updated. We use an ε-greedy policy with ε = 1/√n(s) where n(s) is the number of times the state s has been visited. During the training phase, we set learning rate α(s, a) = 1/n(s,a), discount factor γ = 0.8 and ε = 1/√n(s).