reproducibilityindex.ai

Quality-Based Learning for Web Data Classification

Authors: Ou Wu, Ruiguang Hu, Xue Mao, Weiming Hu

AAAI 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results indicate that the quality-related factors are useful in web data classiﬁcation, and the proposed method outperforms conventional algorithms that do not consider information quantity and quality. Experiments Experimental setup
Researcher Affiliation	Academia	Ou Wu, Ruiguang Hu, Xue Mao, Weiming Hu National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. {wuou, rghu, xmao, wmhu}@nlpr.ia.ac.cn
Pseudocode	Yes	Algorithm 1 Learning (and testing) based on the soft clustering for quality-related factors (LQSC) Input: Training data (X, Y ) and associated quality-related factors Q; a test sample xt and its quality-related factor qt, M, T. Initialize: W (0), B(0). Steps: 1. Cluster quality-related factors Q into M groups using GMM; 2. Calculate Pi for each training sample by using Eq. (10); 3. Learn the feature weights W by iteratively updating W and B by using Eqs. (24) and (25) until the maximum number of iterations (T) is attained or the iteration is converged; 4. Select features according to W; 5. Learn M classiﬁers with selected features for each training subset by solving (26); 6. Calculate the probability vector P(qt) by using Eq. (10); 7. Calculate the new feature vector (x t) of xt based on W; 8. Classify x t by using the M classiﬁers, P(qt), and Eq. (11); Output: The GMM of all the M clusters of quality-related factors, the M classiﬁers, and the predicted label of xt.
Open Source Code	No	The paper does not provide any links to open-source code for the methodology it describes, nor does it explicitly state that its code is being released.
Open Datasets	Yes	The data set consisting of 4427 normal and cannabis web pages in (Wang et al., 2011) is used. The image data introduced in (Zuo et al., 2010) is applied.
Dataset Splits	Yes	The parameters C and g are searched via ﬁve-cross validation in {0.1, 1, 10, 50, 100} and {0.001, 0.01, 0.1, 1, 10}, respectively. The corresponding data subset for each cluster is random split into two equal parts. One part is used for training and the other is used for testing. The random split is repeated 10 times.
Hardware Specification	No	The paper does not explicitly describe the hardware used to run its experiments, such as specific GPU or CPU models.
Software Dependencies	No	The paper mentions using SVM, Random Forest (RF), K-means, and Gaussian Mixture Model (GMM), but it does not provide specific version numbers for any software dependencies, libraries, or programming environments.
Experiment Setup	Yes	The radial basis kernel is chosen for both SVM and LQW. The parameters C and g are searched via ﬁve-cross validation in {0.1, 1, 10, 50, 100} and {0.001, 0.01, 0.1, 1, 10}, respectively. For RF, only the number of trees in {10, 50, 100, 200, 300} is changed, and other parameters are default. Speciﬁcally, the parameter γ in LQHC and LQSC is searched in {0.0001, 0.001, 0.01, 0.1, 1}. The maximum number of iterations used in LQSC is set to 20. In both LQHC and LQSC, the number of clusters (M) is set as 3. In the experiments, K is set to 50 (for image features).