Achieving $\tilde{O}(1/\epsilon)$ Sample Complexity for Constrained Markov Decision Process

Authors: Jiashuo Jiang, Yinyu Ye

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement our Algorithm 2 to study the numerical performance. We consider a CMDP problem with the state space |S| = 10 and the action space |A| = 10. We set the discount factor γ = 0.7. We then randomly generate the probability transition kernel P. To be specific, for each state s S, action a A, and the future state s S, we uniformly generate a randomly variable ps,a,s . Then, the transition probability is defined as P(s |s, a) = ps,a,s P s S ps,a,s . For each state-action pair (s, a) S A, the expected reward ˆr(s, a) is uniformly generated from the interval [1, 2] (with the reward for the first action set to be 0). The actual reward r(s, a) = ˆr(s, a) + η, where η is uniformly distributed among [ 0.5, 0.5]. There are K = 5 constraints and for each constraint k [K] and each state-action pair (s, a) S A, we define the expected cost ˆck(s, a) to be uniformly generated from [1, 2]. The actual cost ck(s, a) = ˆck(s, a) + η , where η is uniformly distributed among [ 0.5, 0.5]. For each total iterations N, We apply Algorithm 2 and obtain the output q1, . . . , q N. We compare q N with the optimal occupancy measure. Since our algorithm is a randomized algorithm, we study the performance of our algorithm in expectation. To be specific, given the problem instance and a fixed N, we implement our algorithm repeatedly for M = 500 rounds. Denote by q N m the output of our Algorithm 2 at round m, for m [M]. We define the error term as Err(N) = 1 M PM m=1 q N m q 1/ q 1. We study how the error term Err(N) scales with N. The results are displayed in Figure 2.
Researcher Affiliation Academia Jiashuo Jiang Department of Industrial Engineering & Decision Analytics Hong Kong University of Science and Technology Hong Kong, China jsjiang@ust.hk Yinyu Ye Department of Management Science & Engineering Institue of Computational Mathematics and Engineering Stanford University California, US yyye@stanford.edu
Pseudocode Yes Algorithm 1 Algorithm for identifying one optimal basis. Algorithm 2 The Adaptive-resolving Algorithm.
Open Source Code No The paper does not contain any statements about releasing code or provide any links to a code repository.
Open Datasets No We consider a CMDP problem with the state space |S| = 10 and the action space |A| = 10. We set the discount factor γ = 0.7. We then randomly generate the probability transition kernel P. ... The expected reward ˆr(s, a) is uniformly generated from the interval [1, 2]... There are K = 5 constraints...
Dataset Splits No The paper describes numerical experiments with randomly generated data for simulation, but does not mention explicit training, validation, or test dataset splits typically used with fixed datasets.
Hardware Specification No The paper describes the experimental setup in Appendix A but does not specify any hardware details like CPU, GPU models, or memory.
Software Dependencies No The paper describes the experimental setup in Appendix A but does not specify any software dependencies or version numbers.
Experiment Setup Yes We consider a CMDP problem with the state space |S| = 10 and the action space |A| = 10. We set the discount factor γ = 0.7. We then randomly generate the probability transition kernel P. To be specific, for each state s S, action a A, and the future state s S, we uniformly generate a randomly variable ps,a,s . Then, the transition probability is defined as P(s |s, a) = ps,a,s P s S ps,a,s . For each state-action pair (s, a) S A, the expected reward ˆr(s, a) is uniformly generated from the interval [1, 2] (with the reward for the first action set to be 0). The actual reward r(s, a) = ˆr(s, a) + η, where η is uniformly distributed among [ 0.5, 0.5]. There are K = 5 constraints and for each constraint k [K] and each state-action pair (s, a) S A, we define the expected cost ˆck(s, a) to be uniformly generated from [1, 2]. The actual cost ck(s, a) = ˆck(s, a) + η , where η is uniformly distributed among [ 0.5, 0.5]. For each total iterations N, We apply Algorithm 2 and obtain the output q1, . . . , q N. We compare q N with the optimal occupancy measure. Since our algorithm is a randomized algorithm, we study the performance of our algorithm in expectation. To be specific, given the problem instance and a fixed N, we implement our algorithm repeatedly for M = 500 rounds.