reproducibilityindex.ai

Scalable Estimation of Dirichlet Process Mixture Models on Distributed Data

Authors: Ruohui Wang, Dahua Lin

IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on large real-world data sets show that the proposed method can achieve high scalability in distributed and asynchronous environments without compromising the mixing performance.
Researcher Affiliation	Academia	Ruohui Wang Department of Information Engineering, The Chinese University of Hong Kong wr013@ie.cuhk.edu.hk Dahua Lin Department of Information Engineering, The Chinese University of Hong Kong dhlin@ie.cuhk.edu.hk
Pseudocode	Yes	Algorithm 1 Progressive Consolidation and Algorithm 2 Restricted Consolidation
Open Source Code	No	The paper does not provide any statement about releasing source code, nor does it include links to a code repository.
Open Datasets	Yes	The Image Net dataset is constructed from the training set of ILSVRC [Russakovsky et al., 2015]... For the New York Time (NYT) Corpus [Sandhaus, 2008]...
Dataset Splits	No	The paper does not provide specific details on training, validation, and test splits (e.g., percentages or counts). It mentions using the 'training set' for ImageNet and 'provided groundtruths' for evaluation but lacks explicit split information.
Hardware Specification	No	We conducted the experiments using up to 30 workers on multiple physical servers. They can communicate with each other via Gigabit Ethernet or TCP loop-back interfaces.
Software Dependencies	No	The paper describes various algorithms and methods but does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn X.Y).
Experiment Setup	Yes	We formulate a Gaussian mixture to describe the feature samples, where the covariance of each Gaussian components is ﬁxed to σ2I with σ = 8. We use N(0, σ2 0I) as the prior distribution over the mean parameters of these components, where σ0 = 8. For the New York Time (NYT) Corpus [Sandhaus, 2008], we construct a vocabulary with 9866 distinct words, and derive a bag-of-word representation for each article. Removing those with less than 20 words, we obtain a data set with about 1.7M articles. We use a mixture of multinomial distribution to describe the NYT corpus. The prior here is a symmetric Dirichlet distribution with hyperparameter γ = 1.