What is Where by Looking: Weakly-Supervised Open-World Phrase-Grounding without Text Inputs
Authors: Tal Shaharabany, Yoad Tewel, Lior Wolf
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present our results for the three tasks: (i) weakly supervised object localization (WSOL) , (ii) weakly supervised phrase grounding (WSG), with training on either MSCOCO 2014 [43] or the Visual Genome (VG) dataset [40], and (iii) the new task we present (WWb L). For the first task, we employ three fine-grained localization datasets, and for the other two, we use the three datasets commonly used in WSG. |
| Researcher Affiliation | Academia | Tal Shaharabany Yoad Tewel Lior Wolf Tel-Aviv University {shaharabany,yoadtewel,wolf}@mail.tau.ac.il |
| Pseudocode | Yes | Algorithm 1: WWb L inference method |
| Open Source Code | Yes | Our code is available at https://github.com/talshaharabany/what-is-where-by-looking and a live demo can be found at https://replicate.com/talshaharabany/ what-is-where-by-looking. |
| Open Datasets | Yes | CUB-200-2011 [76] contains 200 birds species, with 11,788 images divided into 5994 training images and 5794 test images. Stanford Car [39] contains 196 categories of cars, with 8144 samples in the training set and 8041 samples in the test set. Stanford dogs [36] consists of 20,580 images, with a split of 12,000 for training and 8580 for testing, where the data has 120 classes of dogs. MSCOCO 2014 [43], using the splits of Akbari et al. [2], consists of 82,783 training images and 40,504 validation images. VG [40] contains 77,398 training, 5000 validation, and 5000 test images. |
| Dataset Splits | Yes | CUB-200-2011 [76] contains 200 birds species, with 11,788 images divided into 5994 training images and 5794 test images. Stanford Car [39] contains 196 categories of cars, with 8144 samples in the training set and 8041 samples in the test set. Stanford dogs [36] consists of 20,580 images, with a split of 12,000 for training and 8580 for testing, where the data has 120 classes of dogs. MSCOCO 2014 [43], using the splits of Akbari et al. [2], consists of 82,783 training images and 40,504 validation images. VG [40] contains 77,398 training, 5000 validation, and 5000 test images. |
| Hardware Specification | Yes | All models are trained on a single Ge Force RTX 2080Ti Nvidia GPU. All models are trained on a double 2080Ti Nvidia GPU |
| Software Dependencies | No | The paper mentions optimizers (SGD), model architectures (VGG16, MobileNet V1, Resnet50, Inception V3, Dense Net161), but does not provide specific version numbers for any software libraries, programming languages, or environments used (e.g., Python, PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | An SGD optimizer with a batch size of 48 and an initial learning rate of 0.0003 for 100 epochs is used. The optimizer momentum of 0.9 and weight decay of 0.0001 are also used. During the training, a random horizontal flip with 0.5 probability is applied. The combined loss is defined as L = λ1 Lfore(I, t) + λ2 Lback(I, t) + λ3 Lrmap(I, H) + λ4 Lreg(I), where λ1, ..., λ4 are fixed weighting parameters for all datasets, which were determined after a limited hyperparameters search on the CUB [76] validation set to be 1, 1, 4, 1 respectively. |