Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Learning the Variance of the Reward-To-Go

Authors: Aviv Tamar, Dotan Di Castro, Shie Mannor

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we extend temporal diﬀerence (TD) learning algorithms to estimating the variance of the reward-to-go for a ﬁxed policy. We propose variants of both TD(0) and LSTD(λ) with linear function approximation, prove their convergence, and demonstrate their utility in an option pricing problem. Our results show a dramatic improvement in terms of sample eﬃciency over standard Monte-Carlo methods, which are currently the state-of-the-art. ... An empirical evaluation of our approach on an American-style option pricing problem demonstrates a dramatic improvement in terms of sample eﬃciency compared to Monte Carlo techniques the current state of the art. ... In this section we present numerical simulations of policy evaluation for an option pricing domain. We show that in terms of sample eﬃciency, our LSTD(λ) algorithm signiﬁcantly outperforms the current state-of-the-art.
Researcher Affiliation	Collaboration	Aviv Tamar EMAIL Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94709, USA Dotan Di Castro EMAIL Yahoo! Research Labs MATAM, Haifa 31905, Israel Shie Mannor EMAIL Department of Electrical Engineering The Technion Israel Institute of Technology Haifa 32000, Israel
Pseudocode	No	The paper describes algorithms (LSTD, TD(0)) in Section 4 "Simulation Based Estimation Algorithms" with mathematical equations and descriptions but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured, code-like steps.
Open Source Code	Yes	MATLAB R code for reproducing these results is available on the web-site https://sites.google.com/site/variancetdcode/.
Open Datasets	No	The paper focuses on an "American-style put option" problem which is modeled as an MDP. The data for experiments is generated through a "Bernoulli price ﬂuctuation model (Cox et al., 1979)" for simulation, rather than using a pre-existing, publicly available dataset. There is no mention of an open or public dataset for which access information is provided.
Dataset Splits	No	The paper does not use a pre-existing dataset that is split into training, validation, and test sets. Instead, it simulates "N trajectories of the MDP" for its experiments, as stated in Section 4.1: "We simulate N trajectories of the MDP with the policy π and initial state distribution ζ0." and in Section 6.2: "The sample trajectories were simulated independently, starting from uniformly distributed initial states."
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments or simulations. The "Experiments" section and other parts of the paper focus solely on the algorithms and their performance.
Software Dependencies	No	The paper mentions that "MATLAB R code for reproducing these results is available on the web-site...", but it does not specify the version of MATLAB used or any other software dependencies with their respective version numbers, which is necessary for reproducibility.
Experiment Setup	Yes	An American-style put option (Hull, 2006) is a contract which gives the owner the right, but not the obligation, to sell an asset at a speciﬁed strike price K on or before some maturity time t . Letting xt denote the price (state) of the asset at time t t , the immediate payoﬀ of executing the option at that time is therefore max (0, K xt). ... In our experiments we used a Bernoulli price ﬂuctuation model (Cox et al., 1979), ( fuxt, w.p. p fdxt, w.p. 1 p , where the up and down factors, fu and fd, are constant. ... We focus on in-the-money options, in which K is equal to the initial price x0, and set T = 20. A policy π was obtained using the LSPI algorithm (Lagoudakis and Parr, 2003; Li et al., 2009) with 2-dimensional (for x and t) radial basis function (RBF) features, as detailed in Tamar et al. (2014). ... We tested two popular feature sets: RBF features with 77 equally spaced centers, and tile features with 600 uniform non-overlapping tiles. ... Left plot was obtained by LSTD(λ) with RBF features, using 2000 trajectories and λ = 0.3.