Batch-Switching Policy Iteration

Authors: Shivaram Kalyanakrishnan, Utkarsh Mall, Ritish Goyal

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our main contribution is a bound of O(1.6479n) on the number of iterations taken by an instance of BSPI. We believe this is the tightest bound shown yet for any variant of PI. We also present experimental results that suggest Howard s PI might itself enjoy an even tighter bound.
Researcher Affiliation Academia Department of Computer Science and Engineering Indian Institute of Technology Bombay, Mumbai 400076 India
Pseudocode Yes Algorithm BSPI Input parameter b 2 {1, 2, . . . , n}. Arbitrary policy in . Evaluate ; derive IS( ). While IS( ) 6= ; j dn/be. While Bj \ IS( ) = ; j j 1. For i 2 {1, 2, . . . , n}: If si 2 (Bj \ IS( )) Then 0(si) (si)C Else 0(si) (si). 0. Evaluate ; derive IS( ). Return .
Open Source Code No The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets No We generate n-state MDPs through a procedure that picks n/5 states, with equal probability, as target states for each state-action pair: transition probabilities are picked uniformly at random from [0, 1] and subsequently scaled to sum to 1. Each reward is a sample drawn from a standard normal distribution.
Dataset Splits No The paper describes generating MDP instances and initial policies, but does not specify explicit training, validation, or test dataset splits.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers.
Experiment Setup Yes We generate n-state MDPs through a procedure that picks n/5 states, with equal probability, as target states for each state-action pair: transition probabilities are picked uniformly at random from [0, 1] and subsequently scaled to sum to 1. Each reward is a sample drawn from a standard normal distribution. The discount factor γ is set to 0.99. Each point is an average over 100 randomly-generated MDP instances and initial policies.