OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning

Authors: Sheng Liu, Kevin Lin, Lijuan Wang, Junsong Yuan, Zicheng Liu1773-1781

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments on the two datasets, we demonstrate Vi SA s ability to search for visual instances in images not available during training given a wide range of textual queries including those composed of uncommon words. Experimental results show that Vi SA achieves an m AP@50 of 27.8% on OVIS40 and achieves a recall@30 of 21.3% on OVIS1400 dataset under the most challenging settings.
Researcher Affiliation Collaboration Sheng Liu1, Kevin Lin2, Lijuan Wang2, Junsong Yuan1, Zicheng Liu2 1University at Buffalo 2Microsoft {sliu66, jsyuan}@buffalo.edu, {keli, lijuanw, zliu}@microsoft.com
Pseudocode No The paper includes diagrams to illustrate the model and training process, but it does not contain any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We use three image captioning datasets, i.e., Conceptual Captions (CC) (Sharma et al. 2018), SBU Captions (Ordonez, Kulkarni, and Berg 2011) COCO Captions (Lin et al. 2014) to train our model (for MTP). [...] We also use 98K images with a set of 1, 600 categories of visual instance label annotations from Visual Genome (Krishna et al. 2017) to train our model (for ILP).
Dataset Splits No The paper mentions using Conceptual Captions, SBU Captions, COCO Captions, and Visual Genome for training, and OVIS40/OVIS1400 for evaluation, but it does not specify any explicit train/validation/test splits for these datasets or reference standard splits for reproducibility.
Hardware Specification No The paper does not provide any specific hardware details (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments.
Software Dependencies No The paper mentions BERT-Base, Adam W optimizer, and Faster R-CNN, but does not provide specific version numbers for any software dependencies or libraries.
Experiment Setup Yes We train our Vi SA model for 30 epochs with a batch size of 512 using Adam W optimizer (Loshchilov and Hutter 2019). The learning rate is set to 0.00001.