Algorithms for CVaR Optimization in MDPs

Authors: Yinlam Chow, Mohammad Ghavamzadeh

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem. We compare the performance of our risk-sensitive policy gradient Algorithm 1 (PG-CVa R) and two actor-critic Algorithms 2 (AC-CVa R-SPSA,AC-CVa R-Semi-Traj) with their risk-neutral counterparts (PG and AC) (see Appendix C of [13] for the details of these experiments). Figure 1 shows the distribution of the discounted cumulative cost Dθ(x0) for the policy θ learned by each of these algorithms. The results indicate that the risk-sensitive algorithms yield a higher expected loss, but less variance, compared to the risk-neutral methods. More precisely, the loss distributions of the risk-sensitive algorithms have lower right-tail than their risk-neutral counterparts. Table 1 summarizes the performance of these algorithms.
Researcher Affiliation Collaboration Yinlam Chow Institute of Computational & Mathematical Engineering, Stanford University Mohammad Ghavamzadeh Adobe Research & INRIA Lille Team Seque L Part of the work is completed during Yinlam Chow s internship at Adobe Research. Mohammad Ghavamzadeh is at Adobe Research, on leave of absence from INRIA Lille Team Seque L.
Pseudocode Yes Algorithm 1 Trajectory-based Policy Gradient Algorithm for CVa R Optimization
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No We consider an optimal stopping problem in which the state at each time step k T consists of the cost ck and time k, i.e., x = (ck, k), where T is the stopping time. ... The problem has been described in more details in Appendix C of [13].
Dataset Splits No The paper describes a simulated optimal stopping problem but does not specify any dataset splits (training, validation, or testing) or refer to standard benchmark splits.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments.
Software Dependencies No The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup No The step-size schedules satisfy the standard conditions for stochastic approximation algorithms, and ensure that the Va R parameter ν update is on the fastest time-scale ζ3(i) , the policy parameter θ update is on the intermediate time-scale ζ2(i) , and the Lagrange multiplier λ update is on the slowest time-scale ζ1(i) (see Appendix A.1 of [13] for the conditions on the step-size schedules).