Quality-Based Learning for Web Data Classification

Authors: Ou Wu, Ruiguang Hu, Xue Mao, Weiming Hu

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results indicate that the quality-related factors are useful in web data classification, and the proposed method outperforms conventional algorithms that do not consider information quantity and quality. Experiments Experimental setup
Researcher Affiliation Academia Ou Wu, Ruiguang Hu, Xue Mao, Weiming Hu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. {wuou, rghu, xmao, wmhu}@nlpr.ia.ac.cn
Pseudocode Yes Algorithm 1 Learning (and testing) based on the soft clustering for quality-related factors (LQSC) Input: Training data (X, Y ) and associated quality-related factors Q; a test sample xt and its quality-related factor qt, M, T. Initialize: W (0), B(0). Steps: 1. Cluster quality-related factors Q into M groups using GMM; 2. Calculate Pi for each training sample by using Eq. (10); 3. Learn the feature weights W by iteratively updating W and B by using Eqs. (24) and (25) until the maximum number of iterations (T) is attained or the iteration is converged; 4. Select features according to W; 5. Learn M classifiers with selected features for each training subset by solving (26); 6. Calculate the probability vector P(qt) by using Eq. (10); 7. Calculate the new feature vector (x t) of xt based on W; 8. Classify x t by using the M classifiers, P(qt), and Eq. (11); Output: The GMM of all the M clusters of quality-related factors, the M classifiers, and the predicted label of xt.
Open Source Code No The paper does not provide any links to open-source code for the methodology it describes, nor does it explicitly state that its code is being released.
Open Datasets Yes The data set consisting of 4427 normal and cannabis web pages in (Wang et al., 2011) is used. The image data introduced in (Zuo et al., 2010) is applied.
Dataset Splits Yes The parameters C and g are searched via five-cross validation in {0.1, 1, 10, 50, 100} and {0.001, 0.01, 0.1, 1, 10}, respectively. The corresponding data subset for each cluster is random split into two equal parts. One part is used for training and the other is used for testing. The random split is repeated 10 times.
Hardware Specification No The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies No The paper mentions using SVM, Random Forest (RF), K-means, and Gaussian Mixture Model (GMM), but it does not provide specific version numbers for any software dependencies, libraries, or programming environments.
Experiment Setup Yes The radial basis kernel is chosen for both SVM and LQW. The parameters C and g are searched via five-cross validation in {0.1, 1, 10, 50, 100} and {0.001, 0.01, 0.1, 1, 10}, respectively. For RF, only the number of trees in {10, 50, 100, 200, 300} is changed, and other parameters are default. Specifically, the parameter γ in LQHC and LQSC is searched in {0.0001, 0.001, 0.01, 0.1, 1}. The maximum number of iterations used in LQSC is set to 20. In both LQHC and LQSC, the number of clusters (M) is set as 3. In the experiments, K is set to 50 (for image features).