Large-scale Distributed Dependent Nonparametric Trees

Authors: Zhiting Hu, Ho Qirong, Avinava Dubey, Eric Xing

ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5. Experimental Results Our experiments show that (1) our distributed framework achieves near-linear (i.e. near-optimal) scalability with increasing number of cores/machines; (2) the DNTs system enables big tree models (10K nodes) on large data, and well captures long-tail topics. (3) the proposed VI algorithm achieves competitive heldout likelihood with MCMC, and discovers meaningful topic evolution.
Researcher Affiliation Academia Zhiting Hu ZHITINGH@CS.CMU.EDU Language Technologies Institute, Carnegie Mellon University Qirong Ho HOQIRONG@GMAIL.COM Institute for Infocomm Research, A*STAR Avinava Dubey AKDUBEY@CS.CMU.EDU Eric P. Xing EPXING@CS.CMU.EDU Machine Learning Department, Carnegie Mellon University
Pseudocode Yes Algorithm 1 Distributed training for DNTs
Open Source Code No The paper mentions building on 'Petuum parameter server (Ho et al., 2013; Dai et al., 2015) from petuum.org.' but does not provide a statement or link for the open-sourcing of their specific DNTs implementation or the code for the methodology described.
Open Datasets Yes Datasets We use three public corpora for the evaluation: Pub Med: 8,400,000 Pub Med abstracts. The vocabulary is pruned to 70,000 words. Since no time stamp is associated, we treat the whole dataset as from 1 epoch. PNAS: 79,800 paper titles from the Proceedings of the National Academy of Sciences 1915-2005. The vocabulary size is 36,901. We grouped the titles into 10 ten-year epoches. NIPS: 1,740 documents from the Proceedings of the NIPS 1988-1999. The vocabulary size is 13,649. We grouped the documents into 12 one-year epoches.
Dataset Splits No The paper states 'The y-axis represents the per-document heldout likelihood (on 10% heldout test set)' indicating a test split, but does not explicitly mention a separate validation set or its split size.
Hardware Specification Yes All experiments were run on a compute cluster where each machine has 16 cores and 128GB RAM, connected via 1Gbps ethernet.
Software Dependencies No The paper mentions 'We use the Petuum parameter server (Ho et al., 2013; Dai et al., 2015) from petuum.org.' but does not provide specific version numbers for Petuum or any other software dependencies, making it not reproducible.
Experiment Setup Yes For all the experiments, we set κ0 = 100, κ1 = κ2 = 1, and used stick breaking parameters α = γ = 0.5. We estimate each node s v MF concentration parameter β from the data according to (Banerjee et al., 2005). The staleness s for bounded-asynchronous data parallelism is set to 0, which means workers always get up-to-date global parameters from the PS.