Decorrelated Clustering with Data Selection Bias

Authors: Xiao Wang, Shaohua Fan, Kun Kuang, Chuan Shi, Jiawei Liu, Bai Wang

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments results on real world datasets well demonstrate that our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias when clustering.
Researcher Affiliation Academia 1Beijing University of Posts and Telecommunications 2Zhejiang University xiaowang@bupt.edu.cn, fanshaohua92@163.com, kunkuang@zju.edu.cn, {shichuan, liu jiawei, wangbai}@bupt.edu.cn
Pseudocode No The paper describes the optimization process and updating rules for parameters in text and mathematical equations, but it does not include a clearly labeled "Pseudocode" or "Algorithm" block with structured steps.
Open Source Code No The paper does not provide any specific links or explicit statements about the availability of its source code.
Open Datasets Yes Dataset Office-Caltech dataset [Gong et al., 2012]. The officecaltech dataset is a collection of images from four domains (DSLR, Amazon, Webcam, Caltech), which on average have almost a thousand labeled images with 10 categories. It has been widely used in the area of transfer learning [Long et al., 2014], due to the biases created from different data collecting process. We use SURF [Bay et al., 2006] and Bag-of-Words as image features, where the dimension is 500. Office-Home dataset [Venkateswara et al., 2017]. It is an object recognition dataset which contains hundreds of object categories found typically in Office and Home settings.
Dataset Splits No The paper mentions datasets used but does not provide specific details on training, validation, or test splits (e.g., percentages, sample counts, or citations to predefined splits for these purposes).
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instances).
Software Dependencies No The paper mentions that they use SURF and Bag-of-Words for image features, but it does not specify any software names with version numbers for implementation (e.g., Python, PyTorch, scikit-learn, etc.).
Experiment Setup Yes Parameter Setting and Metrics. For DCKM, we fix λ3 = 1 and select λ1 and λ2 from {10 2, 10 1, 1, 10, 102, 103}. For Drop+KM, we set the highly correlation features threshold as 0.7. For PCA+KM, following [Ding and He, 2004], we set the reduced dimension as K-1, where K is the number of clusters. Because all the unsupervised feature selection methods are relatively sensitive to the number of selected features, we grid-search the number of selected features from {50, 100, , 450}. And for all the methods, the number of clusters, i.e., K, is decided by the classes of each subdatasets. Since all the clustering algorithms depend on the initializations, we repeat all the methods 20 times using random initialization and report the average performance.