SceneDiff: Generative Scene-Level Image Retrieval with Text and Sketch Using Diffusion Models

Authors: Ran Zuo, Haoxiang Hu, Xiaoming Deng, Cangjun Gao, Zhengming Zhang, Yu-Kun Lai, Cuixia Ma, Yong-Jin Liu, Hongan Wang

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our method outperforms the state-of-the-art works through extensive experiments, providing a novel insight into the related retrieval field.
Researcher Affiliation Academia Ran Zuo1,2 , Haoxiang Hu1,2 , Xiaoming Deng1,2 , Cangjun Gao1,2 , Zhengming Zhang1,2 , Yu-Kun Lai3 , Cuixia Ma1,2,4 , Yong-Jin Liu5 , Hongan Wang1,2 1Beijing Key Laboratory of Human-Computer Interaction, Institute of Software, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Cardiff University 4Key Laboratory of System Software and State Key Laboratory of Computer Science, Institute of Software, Chinese Academy of Sciences 5Tsinghua University
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. It provides mathematical equations and descriptive text for its methods.
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes (1) Sketchy COCO [Gao et al., 2020] [...] to select 1,015 pairs for training and 210 for testing. (2) FS-COCO [Chowdhury et al., 2022] [...] which includes 7,000/3,000 train/test pairs. (3) SFSD [Zhang et al., 2023b] [...] We divide the dataset into 8,480/3,635 train/test pairs.
Dataset Splits Yes (1) Sketchy COCO [Gao et al., 2020] [...] to select 1,015 pairs for training and 210 for testing. (2) FS-COCO [Chowdhury et al., 2022] [...] which includes 7,000/3,000 train/test pairs. (3) SFSD [Zhang et al., 2023b] [...] We divide the dataset into 8,480/3,635 train/test pairs.
Hardware Specification Yes All experiments are conducted on one NVIDIA A100 80G GPU with learning rate 1e-6 and batch size 4.
Software Dependencies Yes Then we construct the diffusion-based retrieval framework by utilizing the pre-trained SD model with version 1.4, along with its associated pre-trained autoencoder.
Experiment Setup Yes All experiments are conducted on one NVIDIA A100 80G GPU with learning rate 1e-6 and batch size 4. [...] The parameters are set as follows: the number of samplings n is 3, the number of sampling steps k is 2, λ1 and λ2 is 1 and 0.1 respectively.