Dependent nonparametric trees for dynamic hierarchical clustering
Authors: Kumar Avinava Dubey, Qirong Ho, Sinead A Williamson, Eric P Xing
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our model and inference algorithm on both synthetic data and real-world document corpora. We evaluate the performance of our model on both synthetic and real-world data sets. |
| Researcher Affiliation | Academia | Avinava Dubey , Qirong Ho , Sinead Williamson , Eric P. Xing Machine Learning Department, Carnegie Mellon University Institute for Infocomm Research, A*STAR Mc Combs School of Business, University of Texas at Austin akdubey@cs.cmu.edu, hoqirong@gmail.com sinead.williamson@mccombs.utexas.edu, epxing@cs.cmu.edu |
| Pseudocode | No | The paper describes the online learning algorithm in Section 4 but does not present it as structured pseudocode or an algorithm block. |
| Open Source Code | No | The paper does not contain any statements or links indicating that source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes the TWITTER, PNAS, and STATE OF THE UNION (SOU) datasets, including their origin (e.g., 'Proceedings of the National Academy of Sciences', 'Presidential SoU addresses'), but does not provide specific links, DOIs, repositories, or formal citations for these processed datasets to confirm public availability. |
| Dataset Splits | No | The paper states: 'We use the first 50% to tune model parameters and select a good random restart (by training on 90% and testing on 10% of the data at each time point), and then use the last 50% to evaluate the performance of the best parameters/restart (again, by training on 90% and testing on 10% data).' While a portion of the data is used for tuning, it is described as 'testing' data rather than a distinct 'validation' set. |
| Hardware Specification | No | The paper mentions 'Every d TSSBP trial completed in < 20 minutes on a single processor core, while we observed moderate (though not perfectly linear) speedups with 2-4 processors,' but does not provide specific hardware details such as CPU models, GPU types, or memory specifications. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers, such as programming languages, libraries, or solvers. |
| Experiment Setup | Yes | When training the 3 TSSBP-based models, we grid-searched κ0 {1, 10, 100, 1000, 10000}, and fixed κ1 = 1, a = 0 for simplicity. Each value of κ0 was run 5 times to get different random restarts, and we took the best κ0-restart pair for evaluation on the last 50% of time points. For the 3 DP-based models, there is no κ0 parameter, so we simply took 5 random restarts and used the best one for evaluation. For all TSSBPand DP-based models, we repeated the evaluation phase 5 times to get error bars. For all models, we estimated each node/cluster s v MF concentration parameter β from the data. For the TSSBP-based models, we used stick breaking parameters γ = 0.5 and α(d) = 0.5d, and set θ(t) 1 to the average document term frequency vector at time t. In order to keep running times reasonable, we limit the TSSBP-based models to a maximum depth of either 3 or 4 (we report results for both)2. For the DP-based models, we used a Dirichlet process concentration parameter of 1. The d DP s inter-epoch v MF concentration parameter was set to ξ = 0.001. |