Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering

Authors: Noirrit Kiran Chandra, Antonio Canale, David B. Dunson

JMLR 2023 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on sc RNAseq. Keywords: Big data; Clustering; Dirichlet process; Exchangeable partition probability function; High dimensional; Latent variables; Mixture model.
Researcher Affiliation	Academia	Noirrit Kiran Chandra EMAIL Department of Mathematical Sciences The University of Texas at Dallas Richardson, TX, USA Antonio Canale EMAIL Department of Statistical Sciences University of Padova Padova, Italy David B. Dunson EMAIL Department of Statistical Science Durham, NC, USA
Pseudocode	Yes	For posterior computation we use a Gibbs sampler deﬁned by the following steps. Step 1 Letting λT j denote the jth row of Λ, η = [η1, . . . , ηn]T, Dj = τ 2diag(ψj1φ2 j1, . . . , ψjdφ2 jd) and y(j) = (y1j, . . . , ynj)T, for j = 1, . . . , p sample (λj \| ) Nd n (D 1 j + σ 2 j ηTη) 1ηTσ 2 j y(j), (D 1 j + σ 2 j ηTη) 1o . Step 2 Update the h s from the inverse-Wishart distributions IW bψh, bνh where ηh = 1 nh P i:ci=h ηi, bνh = ν0 + nh, bψh = ξ2Id + P i:ci=h(ηi ηh)(ηi ηh)T + κ0nh κ0+nh ηh ηT h . Due to conjugacy, the location parameters µh s can be integrated out of the model. Step 3 Sample the latent factors, for i = 1, . . . , n, from (ηi \| ) Nd Ωhρh, Ωh + Ωh(bκh, i h) 1Ωh , where nh, i = P j =i 1(cj = h), bκh, i = κ0 + nh, i, ηh, i = 1 nh, i P j:cj=h,j =i ηi, bµh, i = nh, i ηh, i nh, i+κ0 , ρh = ΛTΣ 1Yi + 1 h bµh, i and Ω 1 h = ΛTΣ 1Λ + 1 h . Step 4 Sample the cluster indicator variables c1, . . . , cn with probabilities Π(ci = h \| ) ( nh, i R Nd(ηi; µh, h)dΠ(µh, h \| c i, η i) for h c i, α R Nd(ηi; µh, h)dΠ(µh, h) for h / c i. (9) where η i = {ηj : j = i} and c i = {cj : j = i}. Due to conjugacy the above integrals are analytically available. Step 5 Let r be the number of unique ci s. Following West (1992), ﬁrst generate ϕ Beta(α + 1, n), evaluate π/(1 π) = (aα + r 1)/ {n(bα log ϕ)} and generate ( Ga(α + r, bα log ϕ) with probability π, Ga(α + r 1, bα log ϕ) with probability 1 π. Step 6 For j = 1, . . . , p sample σ2 j from Ga n aσ + n/2, bσ + Pn i=1(yij λT j ηi)2/2 o . Step 7 Update the hyper-parameters of the Dirichlet-Laplace prior through: (i) For j = 1, . . . , p and h = 1, . . . d sample eψjh independently from an inverse-Gaussian i G(τφjh/\|λjh\|, 1) distribution and set ψjh = 1/ eψjh. (ii) Sample the full conditional posterior distribution of τ from a generalized inverse Gaussian gi G{dp(1 a), 1, 2 P j,h \|λjh\|/φjh} distribution. (iii) To sample φ \| Λ, draw Tjh independently with Tjh gi G(a 1, 1, 2\|λjh\|) and set φjh = Tjh/T with T = P jh Tjh.
Open Source Code	Yes	The sampler introduced in Section 3.2 is available from the Git Hub page of the ﬁrst author.
Open Datasets	Yes	In this section, we analyze the GSE81861 cell line dataset (Li et al., 2017) to illustrate the proposed method.
Dataset Splits	No	The paper describes a dataset with N=531 cells and P=7,666 genes, but does not provide specific train/test/validation splits. It mentions preprocessing steps but not data partitioning for experimental reproduction.
Hardware Specification	Yes	On average, 6,000 iterations under these settings took between 40 and 50 minutes on a i Mac with 4.2 GHz Quad-Core Intel Core i7 processor and 32GB DDR4 RAM.
Software Dependencies	No	The paper mentions R packages BNPmix (Corradin et al., 2021) and IMIFA (Murphy et al., 2019), but does not specify their exact version numbers.
Experiment Setup	Yes	For all the simulation experiments of the next section and the application, we choose µ0 = 0 and 0 = ξ2Id for a scalar ξ2 > 0. To specify weakly informative priors, we set ξ2 = 20, κ0 = 0.001, ν0 = bd + 50, aα = bα = 0.1 as the hyper-parameters of the DP mixture prior; aσ = 1, bσ = 0.3 as the hyper-parameters of the prior on the residual variances. We set a = 0.5 as the Dirichlet-Laplace parameter following the recommendation of Bhattacharya et al. (2015). [...] We run our sampler for 6,000 iterations discarding the ﬁrst 1,000 as burn in and taking one draw every ﬁve to reduce autocorrelation. Prior elicitation follows the default speciﬁcation of Section 3.1. [...] We implement Lamb using our default prior, collecting 10, 000 iterations after a burn-in of 5, 000 and keeping one draw in ﬁve.