reproducibilityindex.ai

Batch-Switching Policy Iteration

Authors: Shivaram Kalyanakrishnan, Utkarsh Mall, Ritish Goyal

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our main contribution is a bound of O(1.6479n) on the number of iterations taken by an instance of BSPI. We believe this is the tightest bound shown yet for any variant of PI. We also present experimental results that suggest Howard s PI might itself enjoy an even tighter bound.
Researcher Affiliation	Academia	Department of Computer Science and Engineering Indian Institute of Technology Bombay, Mumbai 400076 India
Pseudocode	Yes	Algorithm BSPI Input parameter b 2 {1, 2, . . . , n}. Arbitrary policy in . Evaluate ; derive IS( ). While IS( ) 6= ; j dn/be. While Bj \ IS( ) = ; j j 1. For i 2 {1, 2, . . . , n}: If si 2 (Bj \ IS( )) Then 0(si) (si)C Else 0(si) (si). 0. Evaluate ; derive IS( ). Return .
Open Source Code	No	The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository.
Open Datasets	No	We generate n-state MDPs through a procedure that picks n/5 states, with equal probability, as target states for each state-action pair: transition probabilities are picked uniformly at random from [0, 1] and subsequently scaled to sum to 1. Each reward is a sample drawn from a standard normal distribution.
Dataset Splits	No	The paper describes generating MDP instances and initial policies, but does not specify explicit training, validation, or test dataset splits.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies	No	The paper does not specify any software dependencies with version numbers.
Experiment Setup	Yes	We generate n-state MDPs through a procedure that picks n/5 states, with equal probability, as target states for each state-action pair: transition probabilities are picked uniformly at random from [0, 1] and subsequently scaled to sum to 1. Each reward is a sample drawn from a standard normal distribution. The discount factor γ is set to 0.99. Each point is an average over 100 randomly-generated MDP instances and initial policies.