Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Improving Markov Network Structure Learning Using Decision Trees

Authors: Daniel Lowd, Jesse Davis

JMLR 2014 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In an extensive empirical evaluation on 20 data sets, DTSL is comparable to L1 and significantly faster and more accurate than two other baselines. DT-BLM is slower than DTSL, but obtains slightly higher accuracy. DT+L1 combines the strengths of DTSL and L1 to perform significantly better than either of them with only a modest increase in training time.
Researcher Affiliation	Academia	Daniel Lowd EMAIL Department of Computer and Information Science University of Oregon Eugene, OR 97403, USA Jesse Davis EMAIL Department of Computer Science Katholieke Universiteit Leuven 3001 Heverlee, Belgium
Pseudocode	Yes	Algorithm 1 The DTSL Algorithm function DTSL(training examples D, variables X) F for all Xi X do Ti Learn Tree(D, Xi) Fi Generate Features(Ti) F F Fi end for M Learn Weights(F, D) return M
Open Source Code	Yes	All of our code is available at http://ix.cs.uoregon.edu/~lowd/dtsl under a modiﬁed BSD license.
Open Datasets	Yes	We evaluate our algorithms on 20 real-world data sets. The goals of our experiments are three-fold. First, we want to determine how the diﬀerent feature generation methods aﬀect the performance of DTSL and DT-BLM (Section 7.3). Second, we want to compare the accuracy of DTSL, DT-BLM, and DT+L1 to each other as well as to several state-of-the-art Markov network structure learners: the algorithm of Della Pietra et al. (1997), which we refer to as DP; BLM (Davis and Domingos, 2010); and L1-regularized logistic regression (Ravikumar et al., 2010) (Section 7.4). Finally, we want to compare the running time of these learning algorithms, since this greatly aﬀects their practical utility (Section 7.5). ... These data sets are publicly available at http://alchemy.cs.washington.edu/papers/davis10a.
Dataset Splits	Yes	The text domains contained roughly a 50-50 train-test split, whereas all other domains used around 75% of the data for the training, 10% for tuning, and 15% for testing. Thus we split the test set of these domains to make the proportion of data devoted to each task more closely match the other domains used in the empirical evaluation.
Hardware Specification	No	The paper does not explicitly describe the specific hardware used to run the experiments, such as CPU models, GPU models, or memory details.
Software Dependencies	No	DTSL was implemented in OCaml. For both BLM and DP, we used the publicly available code of Davis and Domingos (2010). ... For Ravikumar et al.’s approach, we tried both the OWL-QN (Andrew and Gao, 2007) and LIBLINEAR (Fan et al., 2008) software packages. ... We computed the conditional marginal probabilities using Gibbs sampling, as implemented in the open-source Libra toolkit. The paper mentions software tools and languages but does not provide specific version numbers for them or any libraries used, which is necessary for reproducibility.
Experiment Setup	Yes	For DTSL, we selected the structure prior κ for each domain that maximized the total log-likelihood of all probabilistic decision trees on the validation set. The values of κ we used were powers of 10, ranging from 0.0001 to 1.0. When learning the weights for each feature generation method, we placed a Gaussian prior with mean 0 on each feature weight and then tuned the standard deviation to maximize PLL on the validation set, with values of 100, 10, 1, and 0.1. ... For L1, on each data set we tried the following values of the LIBLINEAR tuning parameter C: 0.001, 0.01, 0.05, 0.1, 0.5, 1 and 5. ... For all domains, we ran 10 independent chains, each with 100 burn-in samples and followed by 1,000 samples for computing the probability.