Achieving $\tilde{O}(1/\epsilon)$ Sample Complexity for Constrained Markov Decision Process
Authors: Jiashuo Jiang, Yinyu Ye
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We implement our Algorithm 2 to study the numerical performance. We consider a CMDP problem with the state space |S| = 10 and the action space |A| = 10. We set the discount factor γ = 0.7. We then randomly generate the probability transition kernel P. To be specific, for each state s S, action a A, and the future state s S, we uniformly generate a randomly variable ps,a,s . Then, the transition probability is defined as P(s |s, a) = ps,a,s P s S ps,a,s . For each state-action pair (s, a) S A, the expected reward ˆr(s, a) is uniformly generated from the interval [1, 2] (with the reward for the first action set to be 0). The actual reward r(s, a) = ˆr(s, a) + η, where η is uniformly distributed among [ 0.5, 0.5]. There are K = 5 constraints and for each constraint k [K] and each state-action pair (s, a) S A, we define the expected cost ˆck(s, a) to be uniformly generated from [1, 2]. The actual cost ck(s, a) = ˆck(s, a) + η , where η is uniformly distributed among [ 0.5, 0.5]. For each total iterations N, We apply Algorithm 2 and obtain the output q1, . . . , q N. We compare q N with the optimal occupancy measure. Since our algorithm is a randomized algorithm, we study the performance of our algorithm in expectation. To be specific, given the problem instance and a fixed N, we implement our algorithm repeatedly for M = 500 rounds. Denote by q N m the output of our Algorithm 2 at round m, for m [M]. We define the error term as Err(N) = 1 M PM m=1 q N m q 1/ q 1. We study how the error term Err(N) scales with N. The results are displayed in Figure 2. |
| Researcher Affiliation | Academia | Jiashuo Jiang Department of Industrial Engineering & Decision Analytics Hong Kong University of Science and Technology Hong Kong, China jsjiang@ust.hk Yinyu Ye Department of Management Science & Engineering Institue of Computational Mathematics and Engineering Stanford University California, US yyye@stanford.edu |
| Pseudocode | Yes | Algorithm 1 Algorithm for identifying one optimal basis. Algorithm 2 The Adaptive-resolving Algorithm. |
| Open Source Code | No | The paper does not contain any statements about releasing code or provide any links to a code repository. |
| Open Datasets | No | We consider a CMDP problem with the state space |S| = 10 and the action space |A| = 10. We set the discount factor γ = 0.7. We then randomly generate the probability transition kernel P. ... The expected reward ˆr(s, a) is uniformly generated from the interval [1, 2]... There are K = 5 constraints... |
| Dataset Splits | No | The paper describes numerical experiments with randomly generated data for simulation, but does not mention explicit training, validation, or test dataset splits typically used with fixed datasets. |
| Hardware Specification | No | The paper describes the experimental setup in Appendix A but does not specify any hardware details like CPU, GPU models, or memory. |
| Software Dependencies | No | The paper describes the experimental setup in Appendix A but does not specify any software dependencies or version numbers. |
| Experiment Setup | Yes | We consider a CMDP problem with the state space |S| = 10 and the action space |A| = 10. We set the discount factor γ = 0.7. We then randomly generate the probability transition kernel P. To be specific, for each state s S, action a A, and the future state s S, we uniformly generate a randomly variable ps,a,s . Then, the transition probability is defined as P(s |s, a) = ps,a,s P s S ps,a,s . For each state-action pair (s, a) S A, the expected reward ˆr(s, a) is uniformly generated from the interval [1, 2] (with the reward for the first action set to be 0). The actual reward r(s, a) = ˆr(s, a) + η, where η is uniformly distributed among [ 0.5, 0.5]. There are K = 5 constraints and for each constraint k [K] and each state-action pair (s, a) S A, we define the expected cost ˆck(s, a) to be uniformly generated from [1, 2]. The actual cost ck(s, a) = ˆck(s, a) + η , where η is uniformly distributed among [ 0.5, 0.5]. For each total iterations N, We apply Algorithm 2 and obtain the output q1, . . . , q N. We compare q N with the optimal occupancy measure. Since our algorithm is a randomized algorithm, we study the performance of our algorithm in expectation. To be specific, given the problem instance and a fixed N, we implement our algorithm repeatedly for M = 500 rounds. |