Learning Infinite-horizon Average-reward Markov Decision Process with Constraints
Authors: Liyu Chen, Rahul Jain, Haipeng Luo
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Algorithm 1 empirically on a variant of the single hop wireless network environment similar to (Singh et al., 2020), where a wireless node continuously transmits data packets to a receiver. [...] Figure 1. Experiment results of running Algorithm 1 on a variant of the single hop wireless network environment similar to (Singh et al., 2020). The plots from left to right are accumulated regret, accumulated constraint violation, and value of dual variables {λk}k in 3 106 time steps respectively. Each plot is an average of 5 repeated runs, and the shaded area is 95% confidence interval. |
| Researcher Affiliation | Academia | Liyu Chen 1 Rahul Jain 1 Haipeng Luo 1 1University of Southern California. Correspondence to: Liyu Chen <liyuc@usc.edu>. |
| Pseudocode | Yes | The complete pseudocode is presented in Algorithm 1. [...] The pseudocode of ESTIMATEQ in shown in Algorithm 2. |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes a simulated environment setup ('variant of the single hop wireless network environment similar to (Singh et al., 2020)') but does not provide concrete access information (link, DOI, repository, or formal citation with author/year) for a publicly available dataset. |
| Dataset Splits | No | The paper evaluates an RL algorithm in a simulated environment over 'T = 3 * 10^6 time steps' and '5 different random seeds'. It does not describe explicit train/validation/test dataset splits typically found in supervised learning, as the evaluation is done through interaction with the environment. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud resources) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., libraries, frameworks, or solvers) used in the experiments. |
| Experiment Setup | Yes | We run Algorithm 1 for T = 3 106 time steps and 5 different random seeds with the following manually best tuned parameters: H = 300, N = 20, η = T, λ = η, ϵ = 0.01 and θ = 10/ T. We also scale ι and the range of transition confidence sets by a factor of 0.1 to accelerate learning. |