Structure Learning for Safe Policy Improvement

Authors: Thiago D. Simão, Matthijs T. J. Spaan

IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Empirical Analysis We evaluate the Structure Learning Πb-SPIBB framework combined with the two structure learning algorithms presented before (SL and k-meteorologists) in three domains. Figures 1 and 2 present the results. In every plot the x-axis shows the number of trials in the batch collected with the behavior policy.
Researcher Affiliation Academia Thiago D. Sim ao and Matthijs T. J. Spaan Delft University of Technology, The Netherlands {t.diassimao, m.t.j.spaan}@tudelft.nl
Pseudocode Yes Algorithm 1 Policy-based SPIBB (Πb-SPIBB), Algorithm 2 Factored Πb-SPIBB, Algorithm 3 Structure Learning Πb-SPIBB
Open Source Code No The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository or mention code in supplementary materials.
Open Datasets Yes The problems used are: (i) the Taxi domain with a horizon of 200 steps [Dietterich, 1998], (ii) the Sys Admin domain with 9 machines in a bidirectional ring topology and a horizon of 40 steps [Guestrin et al., 2003], and (iii) the Stock-Trading domain with 3 sectors and 2 stocks per sector with a horizon of 40 steps [Strehl et al., 2007].
Dataset Splits No The paper refers to a 'batch D of previous experiences' and 'estimating the performance' but does not specify explicit training, validation, or test dataset splits with percentages, sample counts, or defined methodologies for partitioning the data.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to conduct the experiments.
Software Dependencies No The paper does not specify any software dependencies with their version numbers that would be required to replicate the experiments (e.g., Python, PyTorch, TensorFlow versions, or specific library versions).
Experiment Setup Yes Table 1 reports the parameters used by each algorithm. These values were chosen in order to reduce the number of samples required to improve the policy, while keeping a safe behavior. All algorithms use a flat estimate of the transition function and a flat Value Iteration algorithm with a discount factor of 0.99. The softmax temperature is set to 2 for the Taxi and Stock-Trading domains and to 3 for the Sys Admin domain.