Robust Reinforcement Learning with General Utility
Authors: Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present simulation results of Algorithm 1 for convex utility. Simulation Setting. We choose S = {1, 2, , S} with S = 10 states and A = {1, 2, , A} with A = 5 actions. The discount factor is γ = 0.95 and we select uniform distribution as the initial state distribution ρ. To optimize the objective function (2), we apply direct parameterization to policy parameter θs,a = π(a|s) Θ = ( A)S and transition kernel parameter ξs,a,s = p(s |s, a) ( S)S A. In order to preserve ξ(:, :, s ) S, We select nominal kernel ξ( , , s ) as |10+εs | Ps |10+εs |, where εs i.i.d N(0, 1) for each s S. Then we select sufficiently small radius r = 0.01 < mins,a,s ξs,a,s and use the L2 ambiguity set Ξ := {ξ : ξ(s, :, :) ξ(s, :, :) r} (for transition kernel) such that all ξ Ξ have all positive entries. As for the general utility function f, we use the following convex entropy function with application to exploration (Example 2.2 of [54]). min θ Θ max ξ Ξ f(λθ,ξ) := Xs λθ,ξ(s) log λθ,ξ(s) (29) where λθ,ξ(s) := Pa A λθ,ξ(s, a) denotes the state visitation measure for any s S, θ Θ and ξ Ξ. Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. Results. The numerical result of Algorithm 1 is shown in Figure 1. Here the y-axis is the norm of the true projected gradient q G(θ) b (θk, ξk) 2 + G(ξ) a (θk, ξk) 2 at each outer iteration k of both phases of Algorithm 1 (separated by the green vertical dashed line), and the x-axis is the sample complexity (i.e., the total number of generated samples up to iteration k). Figure 1 shows that the projected gradient decays and converges to a small value, which matches Theorem 2. |
| Researcher Affiliation | Academia | Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang Department of Computer Science, Institute of Health Computing, University of Maryland College Park College Park, MA 20742, USA {zc286,ywen1,zhu123,heng}@umd.edu |
| Pseudocode | Yes | Algorithm 1 Projected Stochastic Gradient Descent Ascent Algorithm For Convex Utility |
| Open Source Code | Yes | We have uploaded our code which generates the simulation data for our experiments. |
| Open Datasets | No | The paper specifies a simulation setting with generated data, not a publicly available dataset with concrete access information (link, DOI, citation). |
| Dataset Splits | No | The paper describes simulation settings and algorithms but does not specify training/test/validation dataset splits (e.g., percentages, sample counts, or predefined splits). |
| Hardware Specification | Yes | Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. |
| Software Dependencies | Yes | Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. |
| Experiment Setup | Yes | Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. |