Feature Sampling Based Unsupervised Semantic Clustering for Real Web Multi-View Content

Authors: Xiaolong Gong, Linpeng Huang, Fuwei Wang102-109

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Compared with some state-of-the-art methods, we demonstrate the effectiveness of our proposed method on a large real-world dataset Doucom and the other three smaller datasets. In this section, we conduct experiments to evaluate the effectiveness of FSUSC. In Table 4, we present results of all methods measured by ACC and NMI for each dataset.
Researcher Affiliation Academia Xiaolong Gong, Linpeng Huang, Fuwei Wang Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China {gxl121438, lphuang, wfwzy2012}@sjtu.edu.cn
Pseudocode Yes Algorithm 1: FSUSC algorithm
Open Source Code No The paper does not include any explicit statements about releasing source code or provide a link to a code repository for the methodology described.
Open Datasets Yes Doucom. This large-scale dataset is crawled from a famous web community, called Douban2, we collect four views for this dataset, including 31297 summaries, 2,995,406 comments, 608,158 reviews and 461,358 users. ... 2https://developers.douban.com/wiki/?title=api v2. Last.fm. This dataset consists of 9,694 items(artists)... 3http://www.last.fm/api. Yelp. This dataset is a subset of the Yelp Challenge Dataset (YDC)4, which includes 11,537 items (businesses) in total. ... 4http://www.yelp.com/dataset challenge. 3-Sources.5 This text dataset was collected from three well-known online news sources... 5http://mlg.ucd.ie/datasets
Dataset Splits No The paper uses datasets for evaluation and mentions training ('We train models using...') but does not explicitly provide specific percentages, absolute counts, or detailed methodologies for train/validation/test splits, or cross-validation setup needed to reproduce the data partitioning. It only mentions initializing U(l) and V(l) using k-means 100 times on the combined data.
Hardware Specification No The paper does not provide any specific hardware details such as GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper mentions techniques like Latent Dirichlet Allocation and TF-IDF transformation, but it does not specify any programming languages, libraries, or solvers with their version numbers (e.g., Python 3.x, PyTorch 1.x, scikit-learn 0.x).
Experiment Setup Yes In this work, we have six major parameters: { σ, α, β, γ, η, K}. We empirically set σ = 0.01 in kernel function. α, β, γ are the weight coefficients, which are set to 1, 2, 1, respectively. η is a reduction factor which controls the different size of feature subset, we set η = 8 in final results and compare the computation time with different value at per iteration(See Fig.1. and Fig.3.). Also we compare the time consumption of the proposed algorithm with necessary baseline models in Table 3. K is a reduced dimension number that equals to cluster numbers which are described in each dataset.