Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Spatial Multivariate Trees for Big Data Bayesian Regression

Authors: Michele Peruzzi, David B. Dunson

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In addition to simulated data examples, we illustrate Spam Trees using a large climate data set which combines satellite data with land-based station data. Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.
Researcher Affiliation	Academia	Michele Peruzzi EMAIL David B. Dunson EMAIL Department of Statistical Science Duke University Durham, NC 27708-0251, USA
Pseudocode	Yes	Algorithm 1: Computing p(w \| θ). Input:C[j] for all j from Algorithm 1; W e = S r is even V r; W o = S r is odd V r; for i {e, o} do for j : {vj W i} do // [parallel for] Sample wj N(µj, Σj) using (17); Let Pa[vj] = {vp}, then m(c) p = H j R 1 j wj and F (c) p = H j R 1 j Hj; Result: sample from p(wj \| w j, y, β, θ, τ) for all vj V . Algorithm 2: Sampling from the full conditional distribution of wi when δ = 1.
Open Source Code	Yes	Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.
Open Datasets	Yes	In addition to simulated data examples, we illustrate Spam Trees using a large climate data set which combines satellite data with land-based station data.
Dataset Splits	Yes	From the resulting data set of size n =1,027,562 we remove all observations in a large 3 3 degree area in the central U.S. (from -100W to -97W and from 35N to 38N, i.e. the red area of Figure 5) to build a test set on which we calculate coverage and RMSE of the predictions.
Hardware Specification	Yes	Multivariate models were run on an AMD Epyc 7452-based virtual machine with 256GB of memory in the Microsoft Azure cloud; the Spam Tree R package was set to run on 20 CPU threads, on R version 4.0.3 linked to the Intel Math Kernel Library (MKL) version 2019.5-075. The univariate models were run on an AMD Ryzen 5950X-based dedicated server with 128GB of memory, on 16 threads, R version 4.1.1 linked to Intel MKL 2019.5-075.
Software Dependencies	Yes	The spamtree package is written in C++ using the Armadillo library for linear algebra (Sanderson and Curtin, 2016) interfaced to R via Rcpp Armadillo (Eddelbuettel and Sanderson, 2014). All matrix operations are performed eﬃciently by linkage to the LAPACK and BLAS libraries (Blackford et al., 2002; Anderson et al., 1999) as implemented in Open BLAS 0.3.10 (Zhang, 2020) or the Intel Math Kernel Library. Multithreaded operations proceed via Open MP (Dagum and Menon, 1998).
Experiment Setup	Yes	We simulate data from model (15), setting β = 0, Z = Iq and take the measurement locations on a regular grid of size 70 70 for a total of 4,900 spatial locations. We simulate the bivariate spatial ﬁeld by sampling from the full GP using (18) as cross-covariance function; the nuggets for the two outcomes are set to τ 2 1 = 0.01 and τ 2 2 = 0.1. For j = 1, 2 we ﬁx σj2 = 1, α = 1, β = 1 and independently sample σj1 U( 3, 3), φj U(0.1, 3), φ U(0.1, 30), δ12 Exp(1), generating a total of 500 bivariate data sets. This setup leads to empirical spatial correlations between the two outcomes smaller than 0.25, between 0.25 and 0.75, and larger than 0.75 in absolute value in 107, 330, and 63 of the 500 data sets, respectively.