FineCLIP: Self-distilled Region-based CLIP for Better Fine-grained Understanding
Authors: Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, guoxing Yang, Wei Wei, Huiwen Zhao, Zhiwu Lu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on challenging dense prediction and image-level tasks. [...] Through extensive experimental evaluations, we show that Fine CLIP surpasses previous arts on most dense prediction tasks and image-level tasks under fair comparison settings, demonstrating its effectiveness in both fine-grained understanding and semantic-aligned global representation. |
| Researcher Affiliation | Collaboration | 1Gaoling School of Artificial Intelligence, Renmin University of China 2Meta Brain AGI Lab, Shanghai, China 3R&D Management Department, Honor Device Co., Ltd {jingdong98, xiaolonghe, luzhiwu}@ruc.edu.cn |
| Pseudocode | No | The paper does not contain any blocks explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | We will release the code and generated textual descriptions of regions soon. |
| Open Datasets | Yes | we train Fine CLIP using 8 A800 GPUs on train2017 split of COCO dataset [30], which includes approximately 118K human-annotated image-text pairs along with 970K region-label pairs. |
| Dataset Splits | Yes | Using the COCO val2017 split, we test Fine CLIP designs on the box classification task with pooled region features and image-level retrieval tasks using global embeddings. |
| Hardware Specification | Yes | we train Fine CLIP using 8 A800 GPUs on train2017 split of COCO dataset |
| Software Dependencies | No | The paper lists various software components and models (e.g., BERT, ViT, AdamW, BLIP-2, YOLOv9, PyTorch) but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We train Fine CLIP for 10 epochs using Adam W [32] optimizer with the batch size of 32 per GPU, the learning rate of 1e 5, and the weight decay of 0.1. The coefficients λ and γ in learning objective are both set to 1. In all experiments, we freeze the language encoder L to reduce computational overheads and improve training stability. |