reproducibilityindex.ai

Adaptive Sampling for Best Policy Identification in Markov Decision Processes

Authors: Aymen Al Marjani, Alexandre Proutiere

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we run numerical experiments to compare the performances of KLB-TS and BESPOKE (these are so far the two algorithms with problem-speciﬁc sample complexity guarantees). ... Figure 2 shows the mean sample complexity along with its 2-standard-deviations interval (which seems very small due to the use of a log-scale).
Researcher Affiliation	Academia	1UMPA, ENS Lyon. This work was done while Aymen Al Marjani was at KTH. 2KTH Royal institute of Technology.
Pseudocode	Yes	Algorithm 1 KLB-TS input Black-box planner MDP-SOLVER(), Conﬁdence parameter δ. Collect one sample from each (s,a) in S A. Set t SA and nt(s, a) 1, for all (s,a). Initialize empirical estimate bφt of φ. bπ t MDP-SOLVER(bφt). while Stopping condition (14) is not satisﬁed do Compute allocation vector ω(bφt) of equation (9). Sample from (st+1, at+1) determined by equation (11). For all (s,a) set: ( nt(s, a) + 1 if (s, a) = (st+1, at+1) nt(s, a) Otherwise t t + 1. Update empirical estimate bφt of φ. bπ t MDP-SOLVER(bφt). end while output Empirical optimal policy bπ τ
Open Source Code	No	The paper does not provide any statement or link indicating the availability of open-source code for the described methodology.
Open Datasets	No	To compare the two algorithms, we generated two MDPs randomly: a ﬁrst small MDP with two states and two actions, and a second larger and more realistic MDP with ﬁve states and ten actions per state.
Dataset Splits	No	The paper does not explicitly describe any training, validation, or test dataset splits. It only mentions generating two MDPs for experiments.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies	No	The paper mentions a black-box planner MDP-SOLVER and the Policy Iteration algorithm but does not specify any software names with version numbers or other ancillary software dependencies required for replication.
Experiment Setup	Yes	We used BESPOKE with an accuracy parameter ϵ = 0.9 min (note that min is revealed to BESPOKE). For each value of the conﬁdence level δ, we run 10 simulations for the ﬁrst MDP under both algorithms. To save computation time in the case of the second MDP, we run 5 simulations for each δ and only compare KLB-TS s sample complexity with BESPOKE s initial number of samples nmin which, as noted in Appendix H, contributed for more than 99% of its sample complexity.