reproducibilityindex.ai

On Dataless Hierarchical Text Classification

Authors: Yangqiu Song, Dan Roth

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that bootstrapped dataless classiﬁcation is competitive with supervised classiﬁcation with thousands of labeled examples. Table 1: Comparing supervised and dataless hierarchical text classiﬁcation on the 20NG dataset. Our experiments are designed to study the effectiveness of dataless hierarchical classiﬁcation in comparison to standard supervised classiﬁcation algorithms, and to study the contribution of different semantic representations to the success of the dataless scheme.
Researcher Affiliation	Academia	Yangqiu Song and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign {yqsong,danr}@illinois.edu
Pseudocode	Yes	Algorithm 1 Top-down Pure Dataless HC. Algorithm 2 Bottom-up Pure Dataless HC.
Open Source Code	No	The paper references third-party open-source tools and datasets, such as Liblinear, Senna neural network word embedding, and Mikolov’s tool, but does not provide any statement or link indicating that the authors’ own implementation code for their methodology is open-source or publicly available.
Open Datasets	Yes	20Newsgroups Data (20NG) The 20 newsgroups data (Lang 1995) is usually used as a multi-class classiﬁcation benchmark dataset. It contains about 20,000 newsgroups messages evenly distributed across 20 newsgroups.1 http://qwone.com/ jason/20Newsgroups/ RCV1 Data The RCV1 dataset is an archive of manually labeled newswire stories from Reuter Ltd (Lewis et al. 2004).
Dataset Splits	Yes	For 20NG data, we randomly sample 50% of the document set and allow the bootstrapping process to access it, and we use the rest as test data. For RCV1, bootstrapping can access 80% of the documents for training, for compatibility with the supervised methods.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions software tools like LBJava, Liblinear, Senna, and Mikolov’s tools, but does not specify their version numbers, only citing the papers that introduced them.
Experiment Setup	Yes	The threshold δ shown in the algorithms is empirically set to be 0.95. Top K labels at each level. We empirically set N = 20 for both datasets.