reproducibilityindex.ai

Safe Policy Improvement with Baseline Bootstrapping in Factored Environments

Authors: Thiago D. Simão, Matthijs T. J. Spaan4967-4974

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The empirical analysis shows that our method can improve the policy using a number of samples potentially one order of magnitude smaller than the ﬂat algorithm. ... We evaluate the proposed factored approaches for the SPI problem focusing on their sample efﬁciency and generalization capability. All algorithms use a ﬂat representation to estimate the transition function, as in the Πb-SPIBB algorithm, and ﬂat Value Iteration with a discount factor of 0.99 to compute the new policy. ... Figure 1 shows the results obtained.
Researcher Affiliation	Academia	Thiago D. Sim ao, Matthijs T. J. Spaan Delft University of Technology, The Netherlands {t.diassimao, m.t.j.spaan}@tudelft.nl
Pseudocode	Yes	Algorithm 1 Policy-based SPIBB (Πb-SPIBB). Input: Previous experiences D Input: Parameters ϵ,δ Input: Behavior policy πb Output: Safe Policy 1: Estimate ˆT 2: Compute Bm = Km 3: Compute Πb Equation 6 4: return arg maxπ Πb V (π, ˆ M) and Algorithm 2 Factored Πb-SPIBB. Input: Previous experiences D Input: Parameters ϵ,δ Input: Behavior policy πb Input: Dependency function D Output: Safe Policy 1: Estimate ˆP( j), (Xi,j) Q Equation 2 2: Compute B m = K m 3: Compute Πb Equation 6 4: return arg maxπ Πb V (π, ˆ M)
Open Source Code	No	The paper does not provide any specific repository link, explicit code release statement, or mention code in supplementary materials for the methodology described.
Open Datasets	Yes	We use two domains with known independence between features: i) the Taxi domain (Dietterich 1998) that has 4 conditionally independent features, 500 states, 6 actions and a horizon of 200 steps, and ii) the Sys Admin domain with 8 machines in a bidirectional ring topology (Guestrin et al. 2003), that has 256 states, 9 actions and a horizon of 40.
Dataset Splits	No	The paper describes generating a 'batch of past experiences D' and then evaluating policies through '1000 simulations', but it does not specify explicit train/validation/test dataset splits, sample counts for splits, or cross-validation details for data partitioning.
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment.
Experiment Setup	Yes	All algorithms use a ﬂat representation to estimate the transition function, as in the Πb-SPIBB algorithm, and ﬂat Value Iteration with a discount factor of 0.99 to compute the new policy. ... For the Taxi domain we set m = 10 and mi = 20 for 0 < i X . In the case of the Sys Admin problem we set m = 50 and mi = 10 for 0 < i X .