Algorithms for CVaR Optimization in MDPs
Authors: Yinlam Chow, Mohammad Ghavamzadeh
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem. We compare the performance of our risk-sensitive policy gradient Algorithm 1 (PG-CVa R) and two actor-critic Algorithms 2 (AC-CVa R-SPSA,AC-CVa R-Semi-Traj) with their risk-neutral counterparts (PG and AC) (see Appendix C of [13] for the details of these experiments). Figure 1 shows the distribution of the discounted cumulative cost Dθ(x0) for the policy θ learned by each of these algorithms. The results indicate that the risk-sensitive algorithms yield a higher expected loss, but less variance, compared to the risk-neutral methods. More precisely, the loss distributions of the risk-sensitive algorithms have lower right-tail than their risk-neutral counterparts. Table 1 summarizes the performance of these algorithms. |
| Researcher Affiliation | Collaboration | Yinlam Chow Institute of Computational & Mathematical Engineering, Stanford University Mohammad Ghavamzadeh Adobe Research & INRIA Lille Team Seque L Part of the work is completed during Yinlam Chow s internship at Adobe Research. Mohammad Ghavamzadeh is at Adobe Research, on leave of absence from INRIA Lille Team Seque L. |
| Pseudocode | Yes | Algorithm 1 Trajectory-based Policy Gradient Algorithm for CVa R Optimization |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | We consider an optimal stopping problem in which the state at each time step k T consists of the cost ck and time k, i.e., x = (ck, k), where T is the stopping time. ... The problem has been described in more details in Appendix C of [13]. |
| Dataset Splits | No | The paper describes a simulated optimal stopping problem but does not specify any dataset splits (training, validation, or testing) or refer to standard benchmark splits. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | No | The step-size schedules satisfy the standard conditions for stochastic approximation algorithms, and ensure that the Va R parameter ν update is on the fastest time-scale ζ3(i) , the policy parameter θ update is on the intermediate time-scale ζ2(i) , and the Lagrange multiplier λ update is on the slowest time-scale ζ1(i) (see Appendix A.1 of [13] for the conditions on the step-size schedules). |