Learning Infinite-horizon Average-reward Markov Decision Process with Constraints

Authors: Liyu Chen, Rahul Jain, Haipeng Luo

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Algorithm 1 empirically on a variant of the single hop wireless network environment similar to (Singh et al., 2020), where a wireless node continuously transmits data packets to a receiver. [...] Figure 1. Experiment results of running Algorithm 1 on a variant of the single hop wireless network environment similar to (Singh et al., 2020). The plots from left to right are accumulated regret, accumulated constraint violation, and value of dual variables {λk}k in 3 106 time steps respectively. Each plot is an average of 5 repeated runs, and the shaded area is 95% confidence interval.
Researcher Affiliation Academia Liyu Chen 1 Rahul Jain 1 Haipeng Luo 1 1University of Southern California. Correspondence to: Liyu Chen <liyuc@usc.edu>.
Pseudocode Yes The complete pseudocode is presented in Algorithm 1. [...] The pseudocode of ESTIMATEQ in shown in Algorithm 2.
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets No The paper describes a simulated environment setup ('variant of the single hop wireless network environment similar to (Singh et al., 2020)') but does not provide concrete access information (link, DOI, repository, or formal citation with author/year) for a publicly available dataset.
Dataset Splits No The paper evaluates an RL algorithm in a simulated environment over 'T = 3 * 10^6 time steps' and '5 different random seeds'. It does not describe explicit train/validation/test dataset splits typically found in supervised learning, as the evaluation is done through interaction with the environment.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud resources) used for running the experiments.
Software Dependencies No The paper does not provide specific version numbers for software dependencies (e.g., libraries, frameworks, or solvers) used in the experiments.
Experiment Setup Yes We run Algorithm 1 for T = 3 106 time steps and 5 different random seeds with the following manually best tuned parameters: H = 300, N = 20, η = T, λ = η, ϵ = 0.01 and θ = 10/ T. We also scale ι and the range of transition confidence sets by a factor of 0.1 to accelerate learning.