Image Clustering Conditioned on Text Criteria

Authors: Sehyun Kwon, Jaeseung Park, Minkyu Kim, Jaewoong Cho, Ernest K. Ryu, Kangwook Lee

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 EXPERIMENTS We now present experimental results demonstrating the effectiveness of IC|TC.
Researcher Affiliation Collaboration Sehyun Kwon 1, Jaeseung Park 1, Minkyu Kim , Jaewoong Cho , Ernest K. Ryu , Kangwook Lee Seoul National University, KRAFTON, University of Wisconsin Madison, Co-senior authors
Pseudocode Yes IC|TC: IMAGE CLUSTERING CONDITIONED ON TEXT CRITERIA Our main method consists of 3 stages with an optional iterative outer loop. ... Step 1 Vision-language model (VLM) extracts salient features ... Step 2 Large Language Model (LLM) obtains K cluster names ... Step 3 Large Language Model (LLM) assigns clusters to images ... Main method IC|TC
Open Source Code Yes 2 Our code is available at https://github.com/sehyunkwon/ICTC.
Open Datasets Yes We use the Stanford 40 Action Dataset (Yao et al., 2011)... We use the People Playing Musical Instrument (PPMI) dataset (Wang et al., 2010; Yao and Fei-Fei, 2010)... We compare IC|TC against several classical clustering algorithms on CIFAR-10, STL-10, and CIFAR-100.
Dataset Splits No The paper states using various standard datasets (e.g., CIFAR-10, STL-10, CIFAR-100), but it does not explicitly provide specific percentages, sample counts, or clear references to the exact train/validation/test splits used for these datasets within the paper.
Hardware Specification No The paper mentions using LLa VA and GPT-4, accessing GPT-4 through its API, but does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running their experiments or training their models.
Software Dependencies Yes In our experiments, we mainly use LLa VA (Liu et al., 2023) for the VLM and GPT-4 (Open AI, 2023) for the LLM...Table 11: Model versions for the VLMs and LLMs (e.g., blip2-flan-t5-xxl, llava-v1-0719-336px-lora-merge-vicuna-13b-v1.3, api-version=2023-03-15-preview)
Experiment Setup Yes In particular, the precise text prompts used can be found in Appendix B.3.1. ... Careful prompt engineering of Pstep2b(TC, N, K) allows the user to refine the clusters to be consistent with the user s criteria. ... we concluded that using threshold values such as 5 or 10 was helpful in getting a better set of clustered classes.