Large-scale Distributed Dependent Nonparametric Trees
Authors: Zhiting Hu, Ho Qirong, Avinava Dubey, Eric Xing
ICML 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experimental Results Our experiments show that (1) our distributed framework achieves near-linear (i.e. near-optimal) scalability with increasing number of cores/machines; (2) the DNTs system enables big tree models (10K nodes) on large data, and well captures long-tail topics. (3) the proposed VI algorithm achieves competitive heldout likelihood with MCMC, and discovers meaningful topic evolution. |
| Researcher Affiliation | Academia | Zhiting Hu ZHITINGH@CS.CMU.EDU Language Technologies Institute, Carnegie Mellon University Qirong Ho HOQIRONG@GMAIL.COM Institute for Infocomm Research, A*STAR Avinava Dubey AKDUBEY@CS.CMU.EDU Eric P. Xing EPXING@CS.CMU.EDU Machine Learning Department, Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 Distributed training for DNTs |
| Open Source Code | No | The paper mentions building on 'Petuum parameter server (Ho et al., 2013; Dai et al., 2015) from petuum.org.' but does not provide a statement or link for the open-sourcing of their specific DNTs implementation or the code for the methodology described. |
| Open Datasets | Yes | Datasets We use three public corpora for the evaluation: Pub Med: 8,400,000 Pub Med abstracts. The vocabulary is pruned to 70,000 words. Since no time stamp is associated, we treat the whole dataset as from 1 epoch. PNAS: 79,800 paper titles from the Proceedings of the National Academy of Sciences 1915-2005. The vocabulary size is 36,901. We grouped the titles into 10 ten-year epoches. NIPS: 1,740 documents from the Proceedings of the NIPS 1988-1999. The vocabulary size is 13,649. We grouped the documents into 12 one-year epoches. |
| Dataset Splits | No | The paper states 'The y-axis represents the per-document heldout likelihood (on 10% heldout test set)' indicating a test split, but does not explicitly mention a separate validation set or its split size. |
| Hardware Specification | Yes | All experiments were run on a compute cluster where each machine has 16 cores and 128GB RAM, connected via 1Gbps ethernet. |
| Software Dependencies | No | The paper mentions 'We use the Petuum parameter server (Ho et al., 2013; Dai et al., 2015) from petuum.org.' but does not provide specific version numbers for Petuum or any other software dependencies, making it not reproducible. |
| Experiment Setup | Yes | For all the experiments, we set κ0 = 100, κ1 = κ2 = 1, and used stick breaking parameters α = γ = 0.5. We estimate each node s v MF concentration parameter β from the data according to (Banerjee et al., 2005). The staleness s for bounded-asynchronous data parallelism is set to 0, which means workers always get up-to-date global parameters from the PS. |