Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Authors: Jae Sung Park, Jack Hessel, Khyathi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results and human evaluations in a zeroshot set up demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM 1. Table 3: Zero-shot results on the localized and non-localized visual reasoning tasks. Figure 4: Effect of data quality controlled by filtering threshold on different datasets. Table 5: Human evaluation of generative models with LSKD vs Chat-GPT with verbalizers.
Researcher Affiliation Collaboration 1University of Washington 2Allen Institute for Artificial Intelligence 3 Microsoft Research 4Carnegie Mellon University 5Yonsei University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'Code will be released in https://github.com/jamespark3922/lskd'. This is a promise for future release, not concrete access at the time of publication.
Open Datasets Yes We use 250K images in union of Visual Genome [26] and VCR [66], which include a diverse set of social situations involving people and objects, as the seed images to collect the knowledge corpus. Following [69], we use the CLIP-Vi TL model in a zero-shot fashion to extract basic concept information about the image using a template. We retrieve places from the Place365 [71], objects from Tencent ML-Images [59], and concepts from Open Images [27] to arrive at global concepts. We fine-tuned OFA-Huge [54] on the Localized Narratives [44] corpus, which pairs 849K images with multi-sentence descriptions. We trained on datasets that provide descriptions of regions within images. a combination of Ref COCO/Ref COCO+/Ref COCOg [64, 37], Sherlock Clues-only [19] (277K), and Visual Genome [26] (1.96M).
Dataset Splits No The paper mentions specific splits for the critic model: 'We allocate a subset of 20K statements to train the critic model, and 4k for evaluation.' However, for the main models trained on benchmark datasets (e.g., VCR, Visual COMET), the paper does not provide specific percentages or sample counts for train/validation splits, relying on the reader's familiarity with the benchmarks.
Hardware Specification Yes All models are trained with learning rate of 1e-5, Adam optimizer [23], linear warmup with cosine annealing, and image size of 480 using 80GB 4 A100 GPUS.
Software Dependencies No The paper mentions using specific models and APIs such as 'Open AI Chat API with gpt-3.5-tubro engine', 'BLIP-2', 'FLAN-t5-xxl', and 'Vicuna-13b-v0'. However, it does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup Yes The BLIP-2 critic model is trained with total batch size of 256, learning rate of 1e-5, max 10 epochs. The discriminative BLIP2 is trained with 256 batch size and 128 max sequence length for 1e4 iterations. The BLIP-2 Flan T5XXL and Mini-GPT4 models are trained with 64 batch size and 2e4 iterations. All models are trained with learning rate of 1e-5, Adam optimizer [23], linear warmup with cosine annealing, and image size of 480.