On Dataless Hierarchical Text Classification
Authors: Yangqiu Song, Dan Roth
AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that bootstrapped dataless classification is competitive with supervised classification with thousands of labeled examples. Table 1: Comparing supervised and dataless hierarchical text classification on the 20NG dataset. Our experiments are designed to study the effectiveness of dataless hierarchical classification in comparison to standard supervised classification algorithms, and to study the contribution of different semantic representations to the success of the dataless scheme. |
| Researcher Affiliation | Academia | Yangqiu Song and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign {yqsong,danr}@illinois.edu |
| Pseudocode | Yes | Algorithm 1 Top-down Pure Dataless HC. Algorithm 2 Bottom-up Pure Dataless HC. |
| Open Source Code | No | The paper references third-party open-source tools and datasets, such as Liblinear, Senna neural network word embedding, and Mikolov’s tool, but does not provide any statement or link indicating that the authors’ own implementation code for their methodology is open-source or publicly available. |
| Open Datasets | Yes | 20Newsgroups Data (20NG) The 20 newsgroups data (Lang 1995) is usually used as a multi-class classification benchmark dataset. It contains about 20,000 newsgroups messages evenly distributed across 20 newsgroups.1 http://qwone.com/ jason/20Newsgroups/ RCV1 Data The RCV1 dataset is an archive of manually labeled newswire stories from Reuter Ltd (Lewis et al. 2004). |
| Dataset Splits | Yes | For 20NG data, we randomly sample 50% of the document set and allow the bootstrapping process to access it, and we use the rest as test data. For RCV1, bootstrapping can access 80% of the documents for training, for compatibility with the supervised methods. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions software tools like LBJava, Liblinear, Senna, and Mikolov’s tools, but does not specify their version numbers, only citing the papers that introduced them. |
| Experiment Setup | Yes | The threshold δ shown in the algorithms is empirically set to be 0.95. Top K labels at each level. We empirically set N = 20 for both datasets. |