Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Robust Reinforcement Learning with General Utility
Authors: Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we present simulation results of Algorithm 1 for convex utility. Simulation Setting. We choose S = {1, 2, , S} with S = 10 states and A = {1, 2, , A} with A = 5 actions. The discount factor is γ = 0.95 and we select uniform distribution as the initial state distribution ρ. To optimize the objective function (2), we apply direct parameterization to policy parameter θs,a = π(a|s) Θ = ( A)S and transition kernel parameter ξs,a,s = p(s |s, a) ( S)S A. In order to preserve ξ(:, :, s ) S, We select nominal kernel ξ( , , s ) as |10+εs | Ps |10+εs |, where εs i.i.d N(0, 1) for each s S. Then we select sufficiently small radius r = 0.01 < mins,a,s ξs,a,s and use the L2 ambiguity set Ξ := {ξ : ξ(s, :, :) ξ(s, :, :) r} (for transition kernel) such that all ξ Ξ have all positive entries. As for the general utility function f, we use the following convex entropy function with application to exploration (Example 2.2 of [54]). min θ Θ max ξ Ξ f(λθ,ξ) := Xs λθ,ξ(s) log λθ,ξ(s) (29) where λθ,ξ(s) := Pa A λθ,ξ(s, a) denotes the state visitation measure for any s S, θ Θ and ξ Ξ. Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. Results. The numerical result of Algorithm 1 is shown in Figure 1. Here the y-axis is the norm of the true projected gradient q G(θ) b (θk, ξk) 2 + G(ξ) a (θk, ξk) 2 at each outer iteration k of both phases of Algorithm 1 (separated by the green vertical dashed line), and the x-axis is the sample complexity (i.e., the total number of generated samples up to iteration k). Figure 1 shows that the projected gradient decays and converges to a small value, which matches Theorem 2. |
| Researcher Affiliation | Academia | Ziyi Chen, Yan Wen, Zhengmian Hu, Heng Huang Department of Computer Science, Institute of Health Computing, University of Maryland College Park College Park, MA 20742, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Projected Stochastic Gradient Descent Ascent Algorithm For Convex Utility |
| Open Source Code | Yes | We have uploaded our code which generates the simulation data for our experiments. |
| Open Datasets | No | The paper specifies a simulation setting with generated data, not a publicly available dataset with concrete access information (link, DOI, citation). |
| Dataset Splits | No | The paper describes simulation settings and algorithms but does not specify training/test/validation dataset splits (e.g., percentages, sample counts, or predefined splits). |
| Hardware Specification | Yes | Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. |
| Software Dependencies | Yes | Environment. The experiment is implemented on Python 3.8 on AMD EPYC-7313 CPU with 3.00GHz, which costs about 1.5 hours in total. |
| Experiment Setup | Yes | Hyperparameters. For Algorithm 1, we use the following hyperparameters obtained from finetuning but not from Theorem 2: K = 200, T = 25, K = 300, T = 25, α = 0.002, β = 0.001, a = 0.002, b = 0.002, Lξ,ξ = 20, m(1) λ = 15, H(1) λ = 100, m(1) θ = 15, H(1) θ = 100, m(2) λ = 15, H(2) λ = 100, m(2) ξ = 15, H(2) ξ = 100, m(3) λ = 10, H(3) λ = 100, m(3) ξ = 10, H(3) ξ = 100, m(4) λ = 10, H(4) λ = 100, m(4) θ = 10, H(4) θ = 100. |