What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs

Authors: Tal Shaharabany, Yoad Tewel, Lior Wolf

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our results for the three tasks: (i) weakly supervised object localization (WSOL) , (ii) weakly supervised phrase grounding (WSG), with training on either MSCOCO 2014 [43] or the Visual Genome (VG) dataset [40], and (iii) the new task we present (WWb L). For the first task, we employ three fine-grained localization datasets, and for the other two, we use the three datasets commonly used in WSG.
Researcher Affiliation Academia Tal Shaharabany Yoad Tewel Lior Wolf Tel-Aviv University {shaharabany,yoadtewel,wolf}@mail.tau.ac.il
Pseudocode Yes Algorithm 1: WWb L inference method
Open Source Code Yes Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://replicate.com/talshaharabany/ what-is-where-by-looking.
Open Datasets Yes CUB-200-2011 [76] contains 200 birds species, with 11,788 images divided into 5994 training images and 5794 test images. Stanford Car [39] contains 196 categories of cars, with 8144 samples in the training set and 8041 samples in the test set. Stanford dogs [36] consists of 20,580 images, with a split of 12,000 for training and 8580 for testing, where the data has 120 classes of dogs. MSCOCO 2014 [43], using the splits of Akbari et al. [2], consists of 82,783 training images and 40,504 validation images. VG [40] contains 77,398 training, 5000 validation, and 5000 test images.
Dataset Splits Yes CUB-200-2011 [76] contains 200 birds species, with 11,788 images divided into 5994 training images and 5794 test images. Stanford Car [39] contains 196 categories of cars, with 8144 samples in the training set and 8041 samples in the test set. Stanford dogs [36] consists of 20,580 images, with a split of 12,000 for training and 8580 for testing, where the data has 120 classes of dogs. MSCOCO 2014 [43], using the splits of Akbari et al. [2], consists of 82,783 training images and 40,504 validation images. VG [40] contains 77,398 training, 5000 validation, and 5000 test images.
Hardware Specification Yes All models are trained on a single Ge Force RTX 2080Ti Nvidia GPU. All models are trained on a double 2080Ti Nvidia GPU
Software Dependencies No The paper mentions optimizers (SGD), model architectures (VGG16, MobileNet V1, Resnet50, Inception V3, Dense Net161), but does not provide specific version numbers for any software libraries, programming languages, or environments used (e.g., Python, PyTorch, TensorFlow, CUDA versions).
Experiment Setup Yes An SGD optimizer with a batch size of 48 and an initial learning rate of 0.0003 for 100 epochs is used. The optimizer momentum of 0.9 and weight decay of 0.0001 are also used. During the training, a random horizontal flip with 0.5 probability is applied. The combined loss is defined as L = λ1 Lfore(I, t) + λ2 Lback(I, t) + λ3 Lrmap(I, H) + λ4 Lreg(I), where λ1, ..., λ4 are fixed weighting parameters for all datasets, which were determined after a limited hyperparameters search on the CUB [76] validation set to be 1, 1, 4, 1 respectively.