Spectral Label Refinement for Noisy and Missing Text Labels
Authors: Yangqiu Song, Chenguang Wang, Ming Zhang, Hailong Sun, Qiang Yang
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of the label refining algorithm on eight labeled document datasets, and validate that the results are useful for generating better labels. Experiments conducted on eight real world datasets have shown its power in following three aspects. |
| Researcher Affiliation | Academia | Yangqiu Songa Chenguang Wangb Ming Zhangb Hailong Sunc Qiang Yangd a University of Illinois at Urbana-Champaign b Peking University c Beihang University d Hong Kong University of Science and Technology |
| Pseudocode | Yes | Algorithm 1 DLSR-based Label Refinement Algorithm |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described in this paper. |
| Open Datasets | Yes | To evaluate our algorithm, we use eight text classification datasets that containing the ground truth labels. Specifically, we use the datasets presented in (Zhong and Ghosh 2005), which are the 20-newsgroups data and the sets from the CLUTO toolkit (Karypis 2002). Eight subsets are selected to test our algorithm, which are summarized in Table 1. The ohscal dataset is from OHSUMED colletion (Hersh et al. 1994). Datasets tr11, tr12, tr23, tr31, tr41 and tr45 are from TREC collections3. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) needed to reproduce the data partitioning into train/validation/test sets. |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions 'Bow toolkit (Mc Callum 1996)' and 'CLUTO toolkit (Karypis 2002)' but does not provide specific version numbers for these or any other ancillary software components. |
| Experiment Setup | Yes | For example, the noise rate 40% represents that we randomly select 40% of the true labels and randomly permute these labels. Here, we set the noise rates as 0%, 20%, 40% and 60%. We set a = 1 and b = 0.001 (defined in Definition 3) for this experiment. All the data are computed using normalized TF-IDF feature. The neighborhood number to construct the content based neighborhood graphs for all the graph based algorithms is empirically set to 10. |