Adaptive Sampling for Best Policy Identification in Markov Decision Processes
Authors: Aymen Al Marjani, Alexandre Proutiere
ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we run numerical experiments to compare the performances of KLB-TS and BESPOKE (these are so far the two algorithms with problem-specific sample complexity guarantees). ... Figure 2 shows the mean sample complexity along with its 2-standard-deviations interval (which seems very small due to the use of a log-scale). |
| Researcher Affiliation | Academia | 1UMPA, ENS Lyon. This work was done while Aymen Al Marjani was at KTH. 2KTH Royal institute of Technology. |
| Pseudocode | Yes | Algorithm 1 KLB-TS input Black-box planner MDP-SOLVER(), Confidence parameter δ. Collect one sample from each (s,a) in S A. Set t SA and nt(s, a) 1, for all (s,a). Initialize empirical estimate bφt of φ. bπ t MDP-SOLVER(bφt). while Stopping condition (14) is not satisfied do Compute allocation vector ω(bφt) of equation (9). Sample from (st+1, at+1) determined by equation (11). For all (s,a) set: ( nt(s, a) + 1 if (s, a) = (st+1, at+1) nt(s, a) Otherwise t t + 1. Update empirical estimate bφt of φ. bπ t MDP-SOLVER(bφt). end while output Empirical optimal policy bπ τ |
| Open Source Code | No | The paper does not provide any statement or link indicating the availability of open-source code for the described methodology. |
| Open Datasets | No | To compare the two algorithms, we generated two MDPs randomly: a first small MDP with two states and two actions, and a second larger and more realistic MDP with five states and ten actions per state. |
| Dataset Splits | No | The paper does not explicitly describe any training, validation, or test dataset splits. It only mentions generating two MDPs for experiments. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions a black-box planner MDP-SOLVER and the Policy Iteration algorithm but does not specify any software names with version numbers or other ancillary software dependencies required for replication. |
| Experiment Setup | Yes | We used BESPOKE with an accuracy parameter ϵ = 0.9 min (note that min is revealed to BESPOKE). For each value of the confidence level δ, we run 10 simulations for the first MDP under both algorithms. To save computation time in the case of the second MDP, we run 5 simulations for each δ and only compare KLB-TS s sample complexity with BESPOKE s initial number of samples nmin which, as noted in Appendix H, contributed for more than 99% of its sample complexity. |