Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spatial Multivariate Trees for Big Data Bayesian Regression

Authors: Michele Peruzzi, David B. Dunson

JMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In addition to simulated data examples, we illustrate Spam Trees using a large climate data set which combines satellite data with land-based station data. Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.
Researcher Affiliation Academia Michele Peruzzi EMAIL David B. Dunson EMAIL Department of Statistical Science Duke University Durham, NC 27708-0251, USA
Pseudocode Yes Algorithm 1: Computing p(w | θ). Input:C[j] for all j from Algorithm 1; W e = S r is even V r; W o = S r is odd V r; for i {e, o} do for j : {vj W i} do // [parallel for] Sample wj N(µj, Σj) using (17); Let Pa[vj] = {vp}, then m(c) p = H j R 1 j wj and F (c) p = H j R 1 j Hj; Result: sample from p(wj | w j, y, β, θ, τ) for all vj V . Algorithm 2: Sampling from the full conditional distribution of wi when δ = 1.
Open Source Code Yes Software and source code are available on CRAN at https://CRAN.R-project.org/package=spamtree.
Open Datasets Yes In addition to simulated data examples, we illustrate Spam Trees using a large climate data set which combines satellite data with land-based station data.
Dataset Splits Yes From the resulting data set of size n =1,027,562 we remove all observations in a large 3 3 degree area in the central U.S. (from -100W to -97W and from 35N to 38N, i.e. the red area of Figure 5) to build a test set on which we calculate coverage and RMSE of the predictions.
Hardware Specification Yes Multivariate models were run on an AMD Epyc 7452-based virtual machine with 256GB of memory in the Microsoft Azure cloud; the Spam Tree R package was set to run on 20 CPU threads, on R version 4.0.3 linked to the Intel Math Kernel Library (MKL) version 2019.5-075. The univariate models were run on an AMD Ryzen 5950X-based dedicated server with 128GB of memory, on 16 threads, R version 4.1.1 linked to Intel MKL 2019.5-075.
Software Dependencies Yes The spamtree package is written in C++ using the Armadillo library for linear algebra (Sanderson and Curtin, 2016) interfaced to R via Rcpp Armadillo (Eddelbuettel and Sanderson, 2014). All matrix operations are performed efficiently by linkage to the LAPACK and BLAS libraries (Blackford et al., 2002; Anderson et al., 1999) as implemented in Open BLAS 0.3.10 (Zhang, 2020) or the Intel Math Kernel Library. Multithreaded operations proceed via Open MP (Dagum and Menon, 1998).
Experiment Setup Yes We simulate data from model (15), setting β = 0, Z = Iq and take the measurement locations on a regular grid of size 70 70 for a total of 4,900 spatial locations. We simulate the bivariate spatial field by sampling from the full GP using (18) as cross-covariance function; the nuggets for the two outcomes are set to τ 2 1 = 0.01 and τ 2 2 = 0.1. For j = 1, 2 we fix σj2 = 1, α = 1, β = 1 and independently sample σj1 U( 3, 3), φj U(0.1, 3), φ U(0.1, 30), δ12 Exp(1), generating a total of 500 bivariate data sets. This setup leads to empirical spatial correlations between the two outcomes smaller than 0.25, between 0.25 and 0.75, and larger than 0.75 in absolute value in 107, 330, and 63 of the 500 data sets, respectively.