reproducibilityindex.ai

Decorrelated Clustering with Data Selection Bias

Authors: Xiao Wang, Shaohua Fan, Kun Kuang, Chuan Shi, Jiawei Liu, Bai Wang

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments results on real world datasets well demonstrate that our DCKM algorithm achieves significant performance gains, indicating the necessity of removing unexpected feature correlations induced by selection bias when clustering.
Researcher Affiliation	Academia	1Beijing University of Posts and Telecommunications 2Zhejiang University xiaowang@bupt.edu.cn, fanshaohua92@163.com, kunkuang@zju.edu.cn, {shichuan, liu jiawei, wangbai}@bupt.edu.cn
Pseudocode	No	The paper describes the optimization process and updating rules for parameters in text and mathematical equations, but it does not include a clearly labeled "Pseudocode" or "Algorithm" block with structured steps.
Open Source Code	No	The paper does not provide any specific links or explicit statements about the availability of its source code.
Open Datasets	Yes	Dataset Ofﬁce-Caltech dataset [Gong et al., 2012]. The ofﬁcecaltech dataset is a collection of images from four domains (DSLR, Amazon, Webcam, Caltech), which on average have almost a thousand labeled images with 10 categories. It has been widely used in the area of transfer learning [Long et al., 2014], due to the biases created from different data collecting process. We use SURF [Bay et al., 2006] and Bag-of-Words as image features, where the dimension is 500. Ofﬁce-Home dataset [Venkateswara et al., 2017]. It is an object recognition dataset which contains hundreds of object categories found typically in Ofﬁce and Home settings.
Dataset Splits	No	The paper mentions datasets used but does not provide specific details on training, validation, or test splits (e.g., percentages, sample counts, or citations to predefined splits for these purposes).
Hardware Specification	No	The paper does not provide any specific details about the hardware used to run the experiments (e.g., GPU/CPU models, memory, or cloud instances).
Software Dependencies	No	The paper mentions that they use SURF and Bag-of-Words for image features, but it does not specify any software names with version numbers for implementation (e.g., Python, PyTorch, scikit-learn, etc.).
Experiment Setup	Yes	Parameter Setting and Metrics. For DCKM, we ﬁx λ3 = 1 and select λ1 and λ2 from {10 2, 10 1, 1, 10, 102, 103}. For Drop+KM, we set the highly correlation features threshold as 0.7. For PCA+KM, following [Ding and He, 2004], we set the reduced dimension as K-1, where K is the number of clusters. Because all the unsupervised feature selection methods are relatively sensitive to the number of selected features, we grid-search the number of selected features from {50, 100, , 450}. And for all the methods, the number of clusters, i.e., K, is decided by the classes of each subdatasets. Since all the clustering algorithms depend on the initializations, we repeat all the methods 20 times using random initialization and report the average performance.