Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Scaling up Data Augmentation MCMC via Calibration

Authors: Leo L. Duan, James E. Johndrow, David B. Dunson

JMLR 2018 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Dramatic gains in computational efficiency are shown in applications. Keywords: Bayesian Probit, Biased subsampling, Big n, Data augmentation, Log-linear model, Logistic regression, Maximal correlation, Polya-Gamma
Researcher Affiliation Academia Leo L. Duan EMAIL Department of Statistics University of Florida Gainesville, FL James E. Johndrow EMAIL Department of Statistics Stanford University Stanford, CA David B. Dunson EMAIL Department of Statistical Science Duke University Durham, NC
Pseudocode No The paper describes specific algorithms and update rules using mathematical notation, but it does not include any clearly labeled pseudocode blocks or algorithm boxes.
Open Source Code No The paper mentions using 'Tensor Flow' for automatic differentiation and optimization, and the 'CODA package in R' for calculating effective sample size, and 'STAN 2.17' for Hamiltonian Monte Carlo, but it does not provide source code for the methodology developed in the paper itself.
Open Datasets Yes The dataset is a large sparse network from the Human Connectome Project (Marcus et al., 2011).
Dataset Splits Yes We use another co-browsing count table for the same high traffic and client sites, collected during a different time period. ... Cross-validation root-mean-squared error P i(ˆy i y i )2/n 1/2 between the prediction and actual count y i s is computed.
Hardware Specification No The paper provides no specific details about the hardware used for running the experiments, only general statements about 'computing time'.
Software Dependencies Yes We run DA for 30,000 steps and CDA for 2,000 steps, so that they have approximately the same effective sample size (calculated with the CODA package in R). ... We ran the ordinary DA algorithm with λ = 1, 000, CDA with λ = 109 and Hamiltonian Monte Carlo with No-U-Turn sampler under the default tuning setting (as implemented in STAN 2.17).
Experiment Setup Yes For illustration, we consider a simulation study for probit regression with an intercept and two predictors xi,1, xi,2 No(1, 1), with θ = ( 5, 1, 1) , generating P i yi 20 among n = 10, 000. For illustration, we use a two-parameter intercept-slope model with xi1 iid No(0, 1) and θ = ( 8, 1) . With n = 105, we obtain rare outcome data with P yi 50. We run DA for 30,000 steps and CDA for 2,000 steps, so that they have approximately the same effective sample size (calculated with the CODA package in R). Both algorithms are initialized at the MAP estimates. We ran the ordinary DA algorithm with λ = 1, 000, CDA with λ = 109 and Hamiltonian Monte Carlo with No-U-Turn sampler under the default tuning setting (as implemented in STAN 2.17). All algorithms are initialized at the MAP. We ran DA for 200,000 steps, CDA for 2,000 steps and HMC for 20,000 steps so that they have approximately the same effective sample size. For CDA, we used the first 1, 000 steps for adapting r and b.