Spectral Label Refinement for Noisy and Missing Text Labels

Authors: Yangqiu Song, Chenguang Wang, Ming Zhang, Hailong Sun, Qiang Yang

AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of the label refining algorithm on eight labeled document datasets, and validate that the results are useful for generating better labels. Experiments conducted on eight real world datasets have shown its power in following three aspects.
Researcher Affiliation Academia Yangqiu Songa Chenguang Wangb Ming Zhangb Hailong Sunc Qiang Yangd a University of Illinois at Urbana-Champaign b Peking University c Beihang University d Hong Kong University of Science and Technology
Pseudocode Yes Algorithm 1 DLSR-based Label Refinement Algorithm
Open Source Code No The paper does not provide concrete access to source code for the methodology described in this paper.
Open Datasets Yes To evaluate our algorithm, we use eight text classification datasets that containing the ground truth labels. Specifically, we use the datasets presented in (Zhong and Ghosh 2005), which are the 20-newsgroups data and the sets from the CLUTO toolkit (Karypis 2002). Eight subsets are selected to test our algorithm, which are summarized in Table 1. The ohscal dataset is from OHSUMED colletion (Hersh et al. 1994). Datasets tr11, tr12, tr23, tr31, tr41 and tr45 are from TREC collections3.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train/validation/test sets.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions 'Bow toolkit (Mc Callum 1996)' and 'CLUTO toolkit (Karypis 2002)' but does not provide specific version numbers for these or any other ancillary software components.
Experiment Setup Yes For example, the noise rate 40% represents that we randomly select 40% of the true labels and randomly permute these labels. Here, we set the noise rates as 0%, 20%, 40% and 60%. We set a = 1 and b = 0.001 (defined in Definition 3) for this experiment. All the data are computed using normalized TF-IDF feature. The neighborhood number to construct the content based neighborhood graphs for all the graph based algorithms is empirically set to 10.