Semantic-Enhanced Image Clustering
Authors: Shaotian Cai, Liping Qiu, Xiaojun Chen, Qin Zhang, Longteng Chen
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on five benchmark datasets clearly show the superiority of our new method. In this section, we conduct experiments on various public benchmark datasets to evaluate our proposed method. |
| Researcher Affiliation | Academia | Shenzhen University, Shenzhen, China cai.st@foxmail.com, qiuliping2021@email.szu.edu.cn, {xjchen, qinzhang}@szu.edu.cn, chenlt2021@163.com |
| Pseudocode | Yes | Algorithm 1: Semantic-Enhanced Image Clustering Input: Images set D, nouns set W, neural networks g(.), h(.) and f(.; ϕ), training epoch T, cluster number c, hyperparameters γu and γr, threshold κ, nearest neighborhoods number k, trade-off parameters λ and β. Output: Cluster assignments Y. Update U = g(D) and V = h(T ). Filter W to obtain the semantic set T and embeddings V via Semantic Space Construction. Initialize ϕ0 and P0. for t = 0 to T do Update Q(t+1) = f(g(D); ϕ(t)). Generate c representative semantic centers H from U, V and Q(t+1). Update pseudo-labels Pt+1 via Eq. (5). Update ϕ(t+1) by optimizing Eq. (11). end Output cluster assignments Y by yi = one-hot argmaxjq(T +1) ij . |
| Open Source Code | No | The paper does not provide an explicit statement of code release or a link to a repository for the methodology described in this paper. |
| Open Datasets | Yes | Datasets. We evaluated our method on five benchmark datasets, i.e. Cifar10 (Krizhevsky 2009), Cifar100-20 (Krizhevsky 2009), STL10 (Coates, Ng, and Lee 2011), Image Net-Dogs (Chang et al. 2017b) and Tiny Image Net (Le and Yang 2015). |
| Dataset Splits | Yes | Table 1: Characteristics of five benchmark datasets. As shown in Table 3, different from most prior methods of training and evaluating the whole datasets on the top corner, we train and evaluate SCAN, NNM and SIC by using the train and val split respectively like SCAN (Van Gansbeke et al. 2020), which allows us to study the generalization properties of our method for novel unseen examples. |
| Hardware Specification | No | The paper mentions using specific models (Vi T-32, Transformer) and libraries (Faiss) but does not provide details about the specific hardware (GPU, CPU models, etc.) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Faiss Library' and specific models but does not provide specific version numbers for software dependencies like Python, PyTorch, or Faiss itself. |
| Experiment Setup | Yes | Implementation details. For representation learning, we used the CLIP pre-training model, whose visual and text backbones are Vi T-32 (Dosovitskiy et al. 2020) and Transformer (Vaswani et al. 2017), separately. We obtained features from the image encoder of CLIP and then trained a cluster head. The cluster head is a fully connected layer with a size of d c, where d and c are the pre-training feature dimension and the number of clusters, respectively. During the training, the epoch numbers of all datasets were set to 100 with a batch size of 128. Before training, all datasets were augmented with the same method used in CLIP (Radford et al. 2021), i.e., a random square crop from resized images. The nearest neighbors were searched through Faiss Library (Johnson, Douze, and J egou 2021). The best hyperparameters used for five benchmark datasets are shown in Table 2. |