Quality-Based Learning for Web Data Classification
Authors: Ou Wu, Ruiguang Hu, Xue Mao, Weiming Hu
AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results indicate that the quality-related factors are useful in web data classification, and the proposed method outperforms conventional algorithms that do not consider information quantity and quality. Experiments Experimental setup |
| Researcher Affiliation | Academia | Ou Wu, Ruiguang Hu, Xue Mao, Weiming Hu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. {wuou, rghu, xmao, wmhu}@nlpr.ia.ac.cn |
| Pseudocode | Yes | Algorithm 1 Learning (and testing) based on the soft clustering for quality-related factors (LQSC) Input: Training data (X, Y ) and associated quality-related factors Q; a test sample xt and its quality-related factor qt, M, T. Initialize: W (0), B(0). Steps: 1. Cluster quality-related factors Q into M groups using GMM; 2. Calculate Pi for each training sample by using Eq. (10); 3. Learn the feature weights W by iteratively updating W and B by using Eqs. (24) and (25) until the maximum number of iterations (T) is attained or the iteration is converged; 4. Select features according to W; 5. Learn M classifiers with selected features for each training subset by solving (26); 6. Calculate the probability vector P(qt) by using Eq. (10); 7. Calculate the new feature vector (x t) of xt based on W; 8. Classify x t by using the M classifiers, P(qt), and Eq. (11); Output: The GMM of all the M clusters of quality-related factors, the M classifiers, and the predicted label of xt. |
| Open Source Code | No | The paper does not provide any links to open-source code for the methodology it describes, nor does it explicitly state that its code is being released. |
| Open Datasets | Yes | The data set consisting of 4427 normal and cannabis web pages in (Wang et al., 2011) is used. The image data introduced in (Zuo et al., 2010) is applied. |
| Dataset Splits | Yes | The parameters C and g are searched via five-cross validation in {0.1, 1, 10, 50, 100} and {0.001, 0.01, 0.1, 1, 10}, respectively. The corresponding data subset for each cluster is random split into two equal parts. One part is used for training and the other is used for testing. The random split is repeated 10 times. |
| Hardware Specification | No | The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper mentions using SVM, Random Forest (RF), K-means, and Gaussian Mixture Model (GMM), but it does not provide specific version numbers for any software dependencies, libraries, or programming environments. |
| Experiment Setup | Yes | The radial basis kernel is chosen for both SVM and LQW. The parameters C and g are searched via five-cross validation in {0.1, 1, 10, 50, 100} and {0.001, 0.01, 0.1, 1, 10}, respectively. For RF, only the number of trees in {10, 50, 100, 200, 300} is changed, and other parameters are default. Specifically, the parameter γ in LQHC and LQSC is searched in {0.0001, 0.001, 0.01, 0.1, 1}. The maximum number of iterations used in LQSC is set to 20. In both LQHC and LQSC, the number of clusters (M) is set as 3. In the experiments, K is set to 50 (for image features). |