reproducibilityindex.ai

Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Authors: Jae Sung Park, Jack Hessel, Khyathi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical results and human evaluations in a zeroshot set up demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM 1. Table 3: Zero-shot results on the localized and non-localized visual reasoning tasks. Figure 4: Effect of data quality controlled by ﬁltering threshold on different datasets. Table 5: Human evaluation of generative models with LSKD vs Chat-GPT with verbalizers.
Researcher Affiliation	Collaboration	1University of Washington 2Allen Institute for Artiﬁcial Intelligence 3 Microsoft Research 4Carnegie Mellon University 5Yonsei University
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: 'Code will be released in https://github.com/jamespark3922/lskd'. This is a promise for future release, not concrete access at the time of publication.
Open Datasets	Yes	We use 250K images in union of Visual Genome [26] and VCR [66], which include a diverse set of social situations involving people and objects, as the seed images to collect the knowledge corpus. Following [69], we use the CLIP-Vi TL model in a zero-shot fashion to extract basic concept information about the image using a template. We retrieve places from the Place365 [71], objects from Tencent ML-Images [59], and concepts from Open Images [27] to arrive at global concepts. We ﬁne-tuned OFA-Huge [54] on the Localized Narratives [44] corpus, which pairs 849K images with multi-sentence descriptions. We trained on datasets that provide descriptions of regions within images. a combination of Ref COCO/Ref COCO+/Ref COCOg [64, 37], Sherlock Clues-only [19] (277K), and Visual Genome [26] (1.96M).
Dataset Splits	No	The paper mentions specific splits for the critic model: 'We allocate a subset of 20K statements to train the critic model, and 4k for evaluation.' However, for the main models trained on benchmark datasets (e.g., VCR, Visual COMET), the paper does not provide specific percentages or sample counts for train/validation splits, relying on the reader's familiarity with the benchmarks.
Hardware Specification	Yes	All models are trained with learning rate of 1e-5, Adam optimizer [23], linear warmup with cosine annealing, and image size of 480 using 80GB 4 A100 GPUS.
Software Dependencies	No	The paper mentions using specific models and APIs such as 'Open AI Chat API with gpt-3.5-tubro engine', 'BLIP-2', 'FLAN-t5-xxl', and 'Vicuna-13b-v0'. However, it does not provide specific version numbers for general software dependencies or libraries (e.g., Python, PyTorch, TensorFlow, or specific library versions).
Experiment Setup	Yes	The BLIP-2 critic model is trained with total batch size of 256, learning rate of 1e-5, max 10 epochs. The discriminative BLIP2 is trained with 256 batch size and 128 max sequence length for 1e4 iterations. The BLIP-2 Flan T5XXL and Mini-GPT4 models are trained with 64 batch size and 2e4 iterations. All models are trained with learning rate of 1e-5, Adam optimizer [23], linear warmup with cosine annealing, and image size of 480.