FlexCap: Describe Anything in Images in Controllable Detail
Authors: Debidatta Dwibedi, Vidhi Jain, Jonathan J. Tompson, Andrew Zisserman, Yusuf Aytar
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate Flex Cap s effectiveness in several applications: first, it achieves strong performance in dense captioning tasks on the Visual Genome dataset. Second, we show how Flex Cap s localized descriptions can serve as input to a large language model to create a visual question answering (VQA) system, achieving state-of-the-art zero-shot performance on multiple VQA benchmarks. |
| Researcher Affiliation | Collaboration | Debidatta Dwibedi Google Deepmind debidatta@google.com Vidhi Jain Carnegie Mellon University vidhij@andrew.cmu.edu Jonathan Tompson Google Deepmind tompson@google.com Andrew Zisserman Google Deepmind zisserman@google.com Yusuf Aytar Google Deepmind yusufaytar@google.com |
| Pseudocode | No | The paper describes methods and architectures but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project webpage: https://flex-cap.github.io. We will release training code and model inference code. The base open-source dataset with which results can be replicated is YFCC100M. The object detector used is OWL-Vi Tv2. The vision backbone is Sig LIP pretrained SO400M. |
| Open Datasets | Yes | We demonstrate this at two scales: 200 million triplets using YFCC100M [47] captioned images; and 32 billion triplets using the Web LI [9] captioned dataset. ... As both OWL-Vi T and the CLIP subset of YFCC100M are publicly available, the resulting localized captions dataset can be generated with open-source models and public datasets. |
| Dataset Splits | Yes | We use the train-test splits and evaluation metric as proposed in [26]. The paper proposes to use a mean of Average Precisions (m AP) over pairwise thresholds of both IOU thresholds (0.3, 0.4, 0.5, 0.6, 0.7) and Meteor score thresholds (0.0, 0.05, 0.1, 0.15, 0.2, 0.25). We use the same preprocessing of text and boxes as mentioned in [52]. We evaluate this by using captions of increasing lengths on the val split of the VQAv2 dataset. |
| Hardware Specification | No | The paper states in the NeurIPS checklist that hardware has been mentioned, but specific details such as GPU/CPU models or processor types are not found in the main text of the paper. |
| Software Dependencies | No | The paper mentions using the 'JAX framework [7]' but does not provide specific version numbers for JAX or other software dependencies. |
| Experiment Setup | Yes | We train the entire model for about 400K steps using the Adam W optimizer with a cosine learning rate schedule. The maximum learning rate is 1.6 10 4 with 10K warm-up steps. We use a weight decay of 0.05. We train with a batch size of 4096 and image resolution of 224 224. We use a maximum text sequence length of 32 tokens. For each image in the batch, we sample a maximum of 8 bounding boxes. |