Cumulative Prospect Theory Meets Reinforcement Learning: Prediction and Control

Authors: Prashanth L.A., Cheng Jie, Michael Fu, Steve Marcus, Csaba Szepesvari

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide theoretical convergence guarantees for all the proposed algorithms and also illustrate the usefulness of CPT-based criteria in a traffic signal control application. Figures 4(b) 4(d) present the histogram of the CPT-values from the test phase for AVG-SPSA, EUT-SPSA and CPTSPSA, respectively.
Researcher Affiliation Academia Prashanth L.A. PRASHLA@ISR.UMD.EDU Institute for Systems Research, University of Maryland Cheng Jie CJIE@MATH.UMD.EDU Department of Mathematics, University of Maryland Michael Fu MFU@ISR.UMD.EDU Robert H. Smith School of Business & Institute for Systems Research, University of Maryland Steve Marcus MARCUS@UMD.EDU Department of Electrical and Computer Engineering & Institute for Systems Research, University of Maryland Csaba Szepesv ari SZEPESVA@CS.UALBERTA.CA Department of Computing Science, University of Alberta
Pseudocode Yes Algorithm 1 CPT-value estimation for H older continuous weights Algorithm 2 Structure of CPT-SPSA-G algorithm.
Open Source Code Yes The experiments are performed using the GLD traffic simulator (Wiering et al., 2004) and the implementation is available at https://bitbucket.org/prashla/rl-gld.
Open Datasets No The paper mentions using the GLD traffic simulator for generating data, but it does not specify a publicly available dataset with concrete access information (link, DOI, repository, or citation).
Dataset Splits No The paper mentions a "training phase" and a "test phase" but does not explicitly describe a validation set or any specific dataset split percentages or counts for training, validation, and test sets. It states: "a training phase where we run each algorithm for 200 iterations, with each iteration involving two perturbed simulations, each of trajectory length 500. This is followed by a test phase where we fix the policy for each algorithm and 100 independent simulations of the MDP (each with a trajectory length of 1000) are performed."
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, memory, or cloud instance types) used for running its experiments.
Software Dependencies No The paper mentions the "GLD traffic simulator (Wiering et al., 2004)" but does not provide specific ancillary software details with version numbers (e.g., libraries, frameworks, or programming language versions).
Experiment Setup Yes For both CPT-SPSA and EUT-SPSA, we set the utility functions (see (1)) as follows: u+(x) = |x|σ, and u (x) = λ|x|σ, where λ = 2.25 and σ = 0.88. For CPT-SPSA, we set the weights as follows: w+(p) = pη1 (pη1 + (1 p)η1)1/η1 , w (p) = pη2 (pη2 + (1 p)η2)1/η2 where η1 = 0.61 and η2 = 0.69. For all the algorithms, motivated by standard guidelines (see Spall 2005), we set δn = 1.9/n0.101 and an = 1/(n + 50). The initial point θ0 is the d-dimensional vector of ones and i, the operator Γi projects θi onto the set [0.1, 10.0]. The experiments involve two phases: first, a training phase where we run each algorithm for 200 iterations, with each iteration involving two perturbed simulations, each of trajectory length 500.