Robust Reinforcement Learning with General Utility

Authors: Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present simulation results of Algorithm 1 for convex utility. Simulation Setting. We choose S = {1, 2, , S} with S = 10 states and A = {1, 2, , A} with A = 5 actions. The discount factor is γ = 0.95 and we select uniform distribution as the initial state distribution ρ. To optimize the objective function (2), we apply direct parameterization to policy parameter θs,a = π(a|s) Θ = ( A)S and transition kernel parameter ξs,a,s = p(s |s, a) ( S)S A. In order to preserve ξ(:, :, s ) S, We select nominal kernel ξ( , , s ) as |10+εs | Ps |10+εs |, where εs i.i.d N(0, 1) for each s S. Then we select sufficiently small radius r = 0.01 < mins,a,s ξs,a,s and use the L2 ambiguity set Ξ := {ξ : ξ(s, :, :) ξ(s, :, :) r} (for transition kernel) such that all ξ Ξ have all positive entries. As for the general utility function f, we use the following convex entropy function with application to exploration (Example 2.2 of [54]). min θ Θ max ξ Ξ f(λθ,ξ) := Xs λθ,ξ(s) log λθ,ξ(s) (29) where λθ,ξ(s) := Pa A λθ,ξ(s, a) denotes the state visitation measure for any s S, θ Θ and ξ Ξ. Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. Results. The numerical result of Algorithm 1 is shown in Figure 1. Here the y-axis is the norm of the true projected gradient q G(θ) b (θk, ξk) 2 + G(ξ) a (θk, ξk) 2 at each outer iteration k of both phases of Algorithm 1 (separated by the green vertical dashed line), and the x-axis is the sample complexity (i.e., the total number of generated samples up to iteration k). Figure 1 shows that the projected gradient decays and converges to a small value, which matches Theorem 2.
Researcher Affiliation Academia Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang Department of Computer Science, Institute of Health Computing, University of Maryland College Park College Park, MA 20742, USA {zc286,ywen1,zhu123,heng}@umd.edu
Pseudocode Yes Algorithm 1 Projected Stochastic Gradient Descent Ascent Algorithm For Convex Utility
Open Source Code Yes We have uploaded our code which generates the simulation data for our experiments.
Open Datasets No The paper specifies a simulation setting with generated data, not a publicly available dataset with concrete access information (link, DOI, citation).
Dataset Splits No The paper describes simulation settings and algorithms but does not specify training/test/validation dataset splits (e.g., percentages, sample counts, or predefined splits).
Hardware Specification Yes Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total.
Software Dependencies Yes Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total.
Experiment Setup Yes Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100.