Structure Learning for Safe Policy Improvement
Authors: Thiago D. Simão, Matthijs T. J. Spaan
IJCAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Empirical Analysis We evaluate the Structure Learning Πb-SPIBB framework combined with the two structure learning algorithms presented before (SL and k-meteorologists) in three domains. Figures 1 and 2 present the results. In every plot the x-axis shows the number of trials in the batch collected with the behavior policy. |
| Researcher Affiliation | Academia | Thiago D. Sim ao and Matthijs T. J. Spaan Delft University of Technology, The Netherlands {t.diassimao, m.t.j.spaan}@tudelft.nl |
| Pseudocode | Yes | Algorithm 1 Policy-based SPIBB (Πb-SPIBB), Algorithm 2 Factored Πb-SPIBB, Algorithm 3 Structure Learning Πb-SPIBB |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include a link to a code repository or mention code in supplementary materials. |
| Open Datasets | Yes | The problems used are: (i) the Taxi domain with a horizon of 200 steps [Dietterich, 1998], (ii) the Sys Admin domain with 9 machines in a bidirectional ring topology and a horizon of 40 steps [Guestrin et al., 2003], and (iii) the Stock-Trading domain with 3 sectors and 2 stocks per sector with a horizon of 40 steps [Strehl et al., 2007]. |
| Dataset Splits | No | The paper refers to a 'batch D of previous experiences' and 'estimating the performance' but does not specify explicit training, validation, or test dataset splits with percentages, sample counts, or defined methodologies for partitioning the data. |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cloud instances) used to conduct the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with their version numbers that would be required to replicate the experiments (e.g., Python, PyTorch, TensorFlow versions, or specific library versions). |
| Experiment Setup | Yes | Table 1 reports the parameters used by each algorithm. These values were chosen in order to reduce the number of samples required to improve the policy, while keeping a safe behavior. All algorithms use a flat estimate of the transition function and a flat Value Iteration algorithm with a discount factor of 0.99. The softmax temperature is set to 2 for the Taxi and Stock-Trading domains and to 3 for the Sys Admin domain. |