FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding

Authors: Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, guoxing Yang, Wei Wei, Huiwen Zhao, Zhiwu Lu

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on challenging dense prediction and image-level tasks. [...] Through extensive experimental evaluations, we show that Fine CLIP surpasses previous arts on most dense prediction tasks and image-level tasks under fair comparison settings, demonstrating its effectiveness in both fine-grained understanding and semantic-aligned global representation.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Meta Brain AGI Lab, Shanghai, China 3R&D Management Department, Honor Device Co., Ltd {jingdong98, xiaolonghe, luzhiwu}@ruc.edu.cn
Pseudocode No The paper does not contain any blocks explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code No We will release the code and generated textual descriptions of regions soon.
Open Datasets Yes we train Fine CLIP using 8 A800 GPUs on train2017 split of COCO dataset [30], which includes approximately 118K human-annotated image-text pairs along with 970K region-label pairs.
Dataset Splits Yes Using the COCO val2017 split, we test Fine CLIP designs on the box classification task with pooled region features and image-level retrieval tasks using global embeddings.
Hardware Specification Yes we train Fine CLIP using 8 A800 GPUs on train2017 split of COCO dataset
Software Dependencies No The paper lists various software components and models (e.g., BERT, ViT, AdamW, BLIP-2, YOLOv9, PyTorch) but does not provide specific version numbers for any of them.
Experiment Setup Yes We train Fine CLIP for 10 epochs using Adam W [32] optimizer with the batch size of 32 per GPU, the learning rate of 1e 5, and the weight decay of 0.1. The coefficients λ and γ in learning objective are both set to 1. In all experiments, we freeze the language encoder L to reduce computational overheads and improve training stability.