reproducibilityindex.ai

Robust Reinforcement Learning with General Utility

Authors: Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present simulation results of Algorithm 1 for convex utility. Simulation Setting. We choose S = {1, 2, , S} with S = 10 states and A = {1, 2, , A} with A = 5 actions. The discount factor is γ = 0.95 and we select uniform distribution as the initial state distribution ρ. To optimize the objective function (2), we apply direct parameterization to policy parameter θs,a = π(a\|s) Θ = ( A)S and transition kernel parameter ξs,a,s = p(s \|s, a) ( S)S A. In order to preserve ξ(:, :, s ) S, We select nominal kernel ξ( , , s ) as \|10+εs \| Ps \|10+εs \|, where εs i.i.d N(0, 1) for each s S. Then we select sufficiently small radius r = 0.01 < mins,a,s ξs,a,s and use the L2 ambiguity set Ξ := {ξ : ξ(s, :, :) ξ(s, :, :) r} (for transition kernel) such that all ξ Ξ have all positive entries. As for the general utility function f, we use the following convex entropy function with application to exploration (Example 2.2 of [54]). min θ Θ max ξ Ξ f(λθ,ξ) := Xs λθ,ξ(s) log λθ,ξ(s) (29) where λθ,ξ(s) := Pa A λθ,ξ(s, a) denotes the state visitation measure for any s S, θ Θ and ξ Ξ. Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. Results. The numerical result of Algorithm 1 is shown in Figure 1. Here the y-axis is the norm of the true projected gradient q G(θ) b (θk, ξk) 2 + G(ξ) a (θk, ξk) 2 at each outer iteration k of both phases of Algorithm 1 (separated by the green vertical dashed line), and the x-axis is the sample complexity (i.e., the total number of generated samples up to iteration k). Figure 1 shows that the projected gradient decays and converges to a small value, which matches Theorem 2.
Researcher Affiliation	Academia	Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang Department of Computer Science, Institute of Health Computing, University of Maryland College Park College Park, MA 20742, USA {zc286,ywen1,zhu123,heng}@umd.edu
Pseudocode	Yes	Algorithm 1 Projected Stochastic Gradient Descent Ascent Algorithm For Convex Utility
Open Source Code	Yes	We have uploaded our code which generates the simulation data for our experiments.
Open Datasets	No	The paper specifies a simulation setting with generated data, not a publicly available dataset with concrete access information (link, DOI, citation).
Dataset Splits	No	The paper describes simulation settings and algorithms but does not specify training/test/validation dataset splits (e.g., percentages, sample counts, or predefined splits).
Hardware Specification	Yes	Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total.
Software Dependencies	Yes	Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total.
Experiment Setup	Yes	Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100.