reproducibilityindex.ai

Learning Bayesian Networks with Thousands of Variables

Authors: Mauro Scanagatta, Cassio P. de Campos, Giorgio Corani, Marco Zaffalon

NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We test our approach on data sets containing up to ten thousand variables. As a performance indicator we consider the score of the network found. Our parent set identiﬁcation approach outperforms consistently the usual approach of setting the maximum in-degree and then computing the score of all parent sets. Our structure optimization approach outperforms Gobnilp when learning with more than 500 nodes. All the software and data sets used in the experiments are available online.
Researcher Affiliation	Academia	Mauro Scanagatta IDSIA , SUPSI , USI Lugano, Switzerland mauro@idsia.ch Cassio P. de Campos Queen s University Belfast Northern Ireland, UK c.decampos@qub.ac.uk Giorgio Corani IDSIA , SUPSI , USI Lugano, Switzerland giorgio@idsia.ch Marco Zaffalon IDSIA Lugano, Switzerland zaffalon@idsia.ch
Pseudocode	Yes	1. Build and keep a Boolean square matrix m to mark which are the descendants of nodes (m(X, Y ) tells whether Y is descendant of X). Start it all false. 2. For each node Vj in the order, with j = n, . . . , 1:
Open Source Code	Yes	All the software and data sets used in the experiments are available online.
Open Datasets	Yes	We consider 16 data sets already used in the literature of structure learning, ﬁrstly introduced in [13] and [8]. We take the largest networks available in the literature: andes (n=223), diabetes (n=413), pigs (n=441), link (n=724), munin (n=1041). Additionally we randomly generate other 15 networks: ﬁve networks of size 2000, ﬁve networks of size 4000, ﬁve networks of size 10000. Each variable has a number of states randomly drawn from 2 to 4 and a number of parents randomly drawn from 0 to 6. Overall we consider 20 networks. From each network we sample a data set of 5000 instances.
Dataset Splits	No	We randomly split each data set into three subsets of instances. (No specific percentages or counts are given, making it non-reproducible for exact splits).
Hardware Specification	No	For a given data set the computation is performed on the same machine. (No specific hardware details are provided).
Software Dependencies	No	The largest data set analyzed in [1] with the Gobnilp1 software contains 413 variables and In [5] Gobnilp is used for structural learning with 1614 variables. No version numbers are provided for any software.
Experiment Setup	Yes	We allow one minute per variable to each approach for parent set identiﬁcation. We set the maximum in-degree to k = 6, a high value that allows learning even complex structures.