Batch-Switching Policy Iteration
Authors: Shivaram Kalyanakrishnan, Utkarsh Mall, Ritish Goyal
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our main contribution is a bound of O(1.6479n) on the number of iterations taken by an instance of BSPI. We believe this is the tightest bound shown yet for any variant of PI. We also present experimental results that suggest Howard s PI might itself enjoy an even tighter bound. |
| Researcher Affiliation | Academia | Department of Computer Science and Engineering Indian Institute of Technology Bombay, Mumbai 400076 India |
| Pseudocode | Yes | Algorithm BSPI Input parameter b 2 {1, 2, . . . , n}. Arbitrary policy in . Evaluate ; derive IS( ). While IS( ) 6= ; j dn/be. While Bj \ IS( ) = ; j j 1. For i 2 {1, 2, . . . , n}: If si 2 (Bj \ IS( )) Then 0(si) (si)C Else 0(si) (si). 0. Evaluate ; derive IS( ). Return . |
| Open Source Code | No | The paper does not explicitly state that source code for the described methodology is publicly available, nor does it provide a link to a code repository. |
| Open Datasets | No | We generate n-state MDPs through a procedure that picks n/5 states, with equal probability, as target states for each state-action pair: transition probabilities are picked uniformly at random from [0, 1] and subsequently scaled to sum to 1. Each reward is a sample drawn from a standard normal distribution. |
| Dataset Splits | No | The paper describes generating MDP instances and initial policies, but does not specify explicit training, validation, or test dataset splits. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers. |
| Experiment Setup | Yes | We generate n-state MDPs through a procedure that picks n/5 states, with equal probability, as target states for each state-action pair: transition probabilities are picked uniformly at random from [0, 1] and subsequently scaled to sum to 1. Each reward is a sample drawn from a standard normal distribution. The discount factor γ is set to 0.99. Each point is an average over 100 randomly-generated MDP instances and initial policies. |