Ferret: Refer and Ground Anything Anywhere at Any Granularity

Authors: Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, Yinfei Yang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data are available at https://github.com/apple/ml-ferret.Our model exhibits superior performance in a wide range of tasks and reduces object hallucination.We start with detailing the proposed hybrid region representation to depict regions of various shapes and formats. Then, we present the model architecture of Ferret.We start with evaluating Ferret on conventional referring and grounding benchmarks (Sec. 4.1 and 4.2).Then, we demonstrate the power of Ferret in more complex multimodal chatting with refer-and-ground capability in Sec. 4.3. For a detailed visualization of each, kindly check Appendix E. We further ablate key components in Ferret (Sec. 4.4), analyze the object hallucination of Ferret (Sec. 4.5) and discuss Ferret v.s. GPT-4V (Sec. ??).
Researcher Affiliation Collaboration Haoxuan You1 , Haotian Zhang2 , Zhe Gan2, Xianzhi Du2, Bowen Zhang2, Zirui Wang2, Liangliang Cao2, Shih-Fu Chang1, Yinfei Yang2 1Columbia University, 2Apple AI/ML
Pseudocode No The paper describes the steps of the spatial-aware visual sampler but does not present them in a structured pseudocode or algorithm block format.
Open Source Code Yes Code and data are available at https://github.com/apple/ml-ferret.
Open Datasets Yes The majority of the dataset is converted from existing vision(-language) tasks like object detection (Krishna et al., 2017) and phrase grounding (Yu et al., 2016; Plummer et al., 2015) with carefully designed templates to make it instruction-following.To achieve visual understanding at the object level, we select object detection datasets such as Visual Genome (Krishna et al., 2017), Object365 (Shao et al., 2019), and visual grounding datasets including Ref COCOs (Yu et al., 2016; Lin et al., 2014; Nagaraja et al., 2016) and Flickr30k-Entities (Plummer et al., 2015).Additionally, to exploit existing instruction-tuning data such as those in LLa VA (Liu et al., 2023b), we apply an open-vocabulary object detector, GLIPv2 (Zhang et al., 2022), on LLa VA-158k data to localize groundable nouns in the text.
Dataset Splits No The paper refers to using existing datasets like Visual Genome, Object365, Ref COCOs, Flickr30k-Entities, and LLaVA-158k data, and converts them into an instruction-following format to create GRIT (1.1M samples). It mentions using the validation split of LVIS for evaluation. However, it does not provide explicit training/validation/test dataset splits for the composite GRIT dataset that would be needed to reproduce the training setup.
Hardware Specification Yes The training takes 5/2.5 days on 8 A100 GPU for a Ferret-13B/7B.
Software Dependencies No The paper mentions software components such as CLIP-ViT-L/14, Vicuna, LLaMA, LLaVA, SAM, and GLIPv2, but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Ferret is trained on the aforementioned GRIT data for three epochs, optimized by Loshchilov & Hutter (2017) with a learning rate of 2e 5 and a batch size of 128.