Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning the Variance of the Reward-To-Go

Authors: Aviv Tamar, Dotan Di Castro, Shie Mannor

JMLR 2016 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we extend temporal difference (TD) learning algorithms to estimating the variance of the reward-to-go for a fixed policy. We propose variants of both TD(0) and LSTD(λ) with linear function approximation, prove their convergence, and demonstrate their utility in an option pricing problem. Our results show a dramatic improvement in terms of sample efficiency over standard Monte-Carlo methods, which are currently the state-of-the-art. ... An empirical evaluation of our approach on an American-style option pricing problem demonstrates a dramatic improvement in terms of sample efficiency compared to Monte Carlo techniques the current state of the art. ... In this section we present numerical simulations of policy evaluation for an option pricing domain. We show that in terms of sample efficiency, our LSTD(λ) algorithm significantly outperforms the current state-of-the-art.
Researcher Affiliation Collaboration Aviv Tamar EMAIL Department of Electrical Engineering and Computer Sciences University of California, Berkeley Berkeley, CA 94709, USA Dotan Di Castro EMAIL Yahoo! Research Labs MATAM, Haifa 31905, Israel Shie Mannor EMAIL Department of Electrical Engineering The Technion Israel Institute of Technology Haifa 32000, Israel
Pseudocode No The paper describes algorithms (LSTD, TD(0)) in Section 4 "Simulation Based Estimation Algorithms" with mathematical equations and descriptions but does not include any explicitly labeled "Pseudocode" or "Algorithm" blocks with structured, code-like steps.
Open Source Code Yes MATLAB R code for reproducing these results is available on the web-site https://sites.google.com/site/variancetdcode/.
Open Datasets No The paper focuses on an "American-style put option" problem which is modeled as an MDP. The data for experiments is generated through a "Bernoulli price fluctuation model (Cox et al., 1979)" for simulation, rather than using a pre-existing, publicly available dataset. There is no mention of an open or public dataset for which access information is provided.
Dataset Splits No The paper does not use a pre-existing dataset that is split into training, validation, and test sets. Instead, it simulates "N trajectories of the MDP" for its experiments, as stated in Section 4.1: "We simulate N trajectories of the MDP with the policy π and initial state distribution ζ0." and in Section 6.2: "The sample trajectories were simulated independently, starting from uniformly distributed initial states."
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory) used to run the experiments or simulations. The "Experiments" section and other parts of the paper focus solely on the algorithms and their performance.
Software Dependencies No The paper mentions that "MATLAB R code for reproducing these results is available on the web-site...", but it does not specify the version of MATLAB used or any other software dependencies with their respective version numbers, which is necessary for reproducibility.
Experiment Setup Yes An American-style put option (Hull, 2006) is a contract which gives the owner the right, but not the obligation, to sell an asset at a specified strike price K on or before some maturity time t . Letting xt denote the price (state) of the asset at time t t , the immediate payoff of executing the option at that time is therefore max (0, K xt). ... In our experiments we used a Bernoulli price fluctuation model (Cox et al., 1979), ( fuxt, w.p. p fdxt, w.p. 1 p , where the up and down factors, fu and fd, are constant. ... We focus on in-the-money options, in which K is equal to the initial price x0, and set T = 20. A policy π was obtained using the LSPI algorithm (Lagoudakis and Parr, 2003; Li et al., 2009) with 2-dimensional (for x and t) radial basis function (RBF) features, as detailed in Tamar et al. (2014). ... We tested two popular feature sets: RBF features with 77 equally spaced centers, and tile features with 600 uniform non-overlapping tiles. ... Left plot was obtained by LSTD(λ) with RBF features, using 2000 trajectories and λ = 0.3.