Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Large-scale Distributed Dependent Nonparametric Trees
Authors: Zhiting Hu, Ho Qirong, Avinava Dubey, Eric Xing
ICML 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5. Experimental Results Our experiments show that (1) our distributed framework achieves near-linear (i.e. near-optimal) scalability with increasing number of cores/machines; (2) the DNTs system enables big tree models (10K nodes) on large data, and well captures long-tail topics. (3) the proposed VI algorithm achieves competitive heldout likelihood with MCMC, and discovers meaningful topic evolution. |
| Researcher Affiliation | Academia | Zhiting Hu EMAIL Language Technologies Institute, Carnegie Mellon University Qirong Ho EMAIL Institute for Infocomm Research, A*STAR Avinava Dubey EMAIL Eric P. Xing EMAIL Machine Learning Department, Carnegie Mellon University |
| Pseudocode | Yes | Algorithm 1 Distributed training for DNTs |
| Open Source Code | No | The paper mentions building on 'Petuum parameter server (Ho et al., 2013; Dai et al., 2015) from petuum.org.' but does not provide a statement or link for the open-sourcing of their specific DNTs implementation or the code for the methodology described. |
| Open Datasets | Yes | Datasets We use three public corpora for the evaluation: Pub Med: 8,400,000 Pub Med abstracts. The vocabulary is pruned to 70,000 words. Since no time stamp is associated, we treat the whole dataset as from 1 epoch. PNAS: 79,800 paper titles from the Proceedings of the National Academy of Sciences 1915-2005. The vocabulary size is 36,901. We grouped the titles into 10 ten-year epoches. NIPS: 1,740 documents from the Proceedings of the NIPS 1988-1999. The vocabulary size is 13,649. We grouped the documents into 12 one-year epoches. |
| Dataset Splits | No | The paper states 'The y-axis represents the per-document heldout likelihood (on 10% heldout test set)' indicating a test split, but does not explicitly mention a separate validation set or its split size. |
| Hardware Specification | Yes | All experiments were run on a compute cluster where each machine has 16 cores and 128GB RAM, connected via 1Gbps ethernet. |
| Software Dependencies | No | The paper mentions 'We use the Petuum parameter server (Ho et al., 2013; Dai et al., 2015) from petuum.org.' but does not provide specific version numbers for Petuum or any other software dependencies, making it not reproducible. |
| Experiment Setup | Yes | For all the experiments, we set κ0 = 100, κ1 = κ2 = 1, and used stick breaking parameters α = γ = 0.5. We estimate each node s v MF concentration parameter β from the data according to (Banerjee et al., 2005). The staleness s for bounded-asynchronous data parallelism is set to 0, which means workers always get up-to-date global parameters from the PS. |