Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Escaping The Curse of Dimensionality in Bayesian Model-Based Clustering
Authors: Noirrit Kiran Chandra, Antonio Canale, David B. Dunson
JMLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed approach is shown to have good performance in simulation studies and an application to inferring cell types based on sc RNAseq. Keywords: Big data; Clustering; Dirichlet process; Exchangeable partition probability function; High dimensional; Latent variables; Mixture model. |
| Researcher Affiliation | Academia | Noirrit Kiran Chandra EMAIL Department of Mathematical Sciences The University of Texas at Dallas Richardson, TX, USA Antonio Canale EMAIL Department of Statistical Sciences University of Padova Padova, Italy David B. Dunson EMAIL Department of Statistical Science Durham, NC, USA |
| Pseudocode | Yes | For posterior computation we use a Gibbs sampler defined by the following steps. Step 1 Letting λT j denote the jth row of Λ, η = [η1, . . . , ηn]T, Dj = τ 2diag(ψj1φ2 j1, . . . , ψjdφ2 jd) and y(j) = (y1j, . . . , ynj)T, for j = 1, . . . , p sample (λj | ) Nd n (D 1 j + σ 2 j ηTη) 1ηTσ 2 j y(j), (D 1 j + σ 2 j ηTη) 1o . Step 2 Update the h s from the inverse-Wishart distributions IW bψh, bνh where ηh = 1 nh P i:ci=h ηi, bνh = ν0 + nh, bψh = ξ2Id + P i:ci=h(ηi ηh)(ηi ηh)T + κ0nh κ0+nh ηh ηT h . Due to conjugacy, the location parameters µh s can be integrated out of the model. Step 3 Sample the latent factors, for i = 1, . . . , n, from (ηi | ) Nd Ωhρh, Ωh + Ωh(bκh, i h) 1Ωh , where nh, i = P j =i 1(cj = h), bκh, i = κ0 + nh, i, ηh, i = 1 nh, i P j:cj=h,j =i ηi, bµh, i = nh, i ηh, i nh, i+κ0 , ρh = ΛTΣ 1Yi + 1 h bµh, i and Ω 1 h = ΛTΣ 1Λ + 1 h . Step 4 Sample the cluster indicator variables c1, . . . , cn with probabilities Π(ci = h | ) ( nh, i R Nd(ηi; µh, h)dΠ(µh, h | c i, η i) for h c i, α R Nd(ηi; µh, h)dΠ(µh, h) for h / c i. (9) where η i = {ηj : j = i} and c i = {cj : j = i}. Due to conjugacy the above integrals are analytically available. Step 5 Let r be the number of unique ci s. Following West (1992), first generate ϕ Beta(α + 1, n), evaluate π/(1 π) = (aα + r 1)/ {n(bα log ϕ)} and generate ( Ga(α + r, bα log ϕ) with probability π, Ga(α + r 1, bα log ϕ) with probability 1 π. Step 6 For j = 1, . . . , p sample σ2 j from Ga n aσ + n/2, bσ + Pn i=1(yij λT j ηi)2/2 o . Step 7 Update the hyper-parameters of the Dirichlet-Laplace prior through: (i) For j = 1, . . . , p and h = 1, . . . d sample eψjh independently from an inverse-Gaussian i G(τφjh/|λjh|, 1) distribution and set ψjh = 1/ eψjh. (ii) Sample the full conditional posterior distribution of τ from a generalized inverse Gaussian gi G{dp(1 a), 1, 2 P j,h |λjh|/φjh} distribution. (iii) To sample φ | Λ, draw Tjh independently with Tjh gi G(a 1, 1, 2|λjh|) and set φjh = Tjh/T with T = P jh Tjh. |
| Open Source Code | Yes | The sampler introduced in Section 3.2 is available from the Git Hub page of the first author. |
| Open Datasets | Yes | In this section, we analyze the GSE81861 cell line dataset (Li et al., 2017) to illustrate the proposed method. |
| Dataset Splits | No | The paper describes a dataset with N=531 cells and P=7,666 genes, but does not provide specific train/test/validation splits. It mentions preprocessing steps but not data partitioning for experimental reproduction. |
| Hardware Specification | Yes | On average, 6,000 iterations under these settings took between 40 and 50 minutes on a i Mac with 4.2 GHz Quad-Core Intel Core i7 processor and 32GB DDR4 RAM. |
| Software Dependencies | No | The paper mentions R packages BNPmix (Corradin et al., 2021) and IMIFA (Murphy et al., 2019), but does not specify their exact version numbers. |
| Experiment Setup | Yes | For all the simulation experiments of the next section and the application, we choose µ0 = 0 and 0 = ξ2Id for a scalar ξ2 > 0. To specify weakly informative priors, we set ξ2 = 20, κ0 = 0.001, ν0 = bd + 50, aα = bα = 0.1 as the hyper-parameters of the DP mixture prior; aσ = 1, bσ = 0.3 as the hyper-parameters of the prior on the residual variances. We set a = 0.5 as the Dirichlet-Laplace parameter following the recommendation of Bhattacharya et al. (2015). [...] We run our sampler for 6,000 iterations discarding the first 1,000 as burn in and taking one draw every five to reduce autocorrelation. Prior elicitation follows the default specification of Section 3.1. [...] We implement Lamb using our default prior, collecting 10, 000 iterations after a burn-in of 5, 000 and keeping one draw in five. |