Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language
Authors: Andy Zeng, Maria Attarian, brian ichter, Krzysztof Marcin Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments with vision, language, and audio modalities show that on various problems, multimodal prompt engineered systems can be quantitatively competitive with zeroshot state-of-the-art on standard benchmarks including (i) image captioning on MS COCO, (ii) contextual image captioning and description (improving 11.3 (Kreiss et al., 2021) to 38.8 captioning CIDEr on Concadia), and (iii) video-to-text retrieval (from 40.3 (Portillo Quintero et al., 2021) to 44.7 zero-shot R@1 on MSR-VTT (Xu et al., 2016)). |
| Researcher Affiliation | Industry | Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S. Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, Pete Florence Google |
| Pseudocode | No | The paper describes methods and processes through textual descriptions and flow diagrams (e.g., Figure 8), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Open-source code is available at https://socraticmodels.github.io. |
| Open Datasets | Yes | We quantitatively evaluate example systems on: image captioning (Sec. 4.1), contextual image captioning (Sec. 4.2), and video-to-text retrieval (Sec. 4.3). (Referring to MS COCO, Concadia, MSR-VTT) and For place recognition, we use a VLM to rank Places365 (Zhou et al., 2016) scene categories against the image... For object and people recognition, we use a VLM to rank Open Images object categories (Kuznetsova et al., 2020)... and trained on 5-second audio clips from the VGGSound dataset (Chen et al., 2020). Also zero-shot image classification accuracy on Image Net (Deng et al., 2009). |
| Dataset Splits | No | The paper explicitly states using a 'random sampled subset of 100 images from the test split' for MS COCO and 'the full Concadia test split with 9,691 images', and 'MSR-VTT ... with the original full test set'. While it mentions few-shot prompting using training examples, it does not specify explicit validation dataset splits created or used by the authors for their experiments. |
| Hardware Specification | Yes | In particular, we use CLIP (Radford et al., 2021) as the text-image similarity VLM (Vi T-L/14 with 428M params, except on MSR-VTT which uses Vi T-B/32)... All pretrained models are used off-the-shelf with no additional finetuning. In terms of compute resources required, all experiments can be run on a single machine using an NVIDIA V100 GPU with internet access for outsourced API calls (e.g., GPT-3 and Google Cloud Speech-to-text). |
| Software Dependencies | No | The paper mentions using specific models and APIs like 'CLIP (Radford et al., 2021)', 'Vi LD (Gu et al., 2021)', 'Wav2CLIP (Wu et al., 2021a)', 'Google Cloud Speech-to-text API (gcl)', 'GPT-3 (Brown et al., 2020; Ouyang et al., 2022)', and 'Ro BERTa (Liu et al., 2019b)'. It also states code can be run with 'Colab'. However, it does not provide specific version numbers for underlying software dependencies such as Python, PyTorch/TensorFlow, or other libraries. |
| Experiment Setup | Yes | Method. We can generate image captions via multimodal prompt engineering between a VLM and LLM i.e., via caption = f 3 LLM(f 2 VLM(f 1 LLM(f VLM(image)))). First (1), the VLM is used to zero-shot detect different place categories (Places356 (Zhou et al., 2016)), object categories (from Tencent ML-Images (Wu et al., 2019)), image type ({photo, cartoon, sketch, painting}) and the number of people {no people, one person, ..., several people}. The top-k ranked in each category can then be substituted into an LLM prompt as context, shown in Fig. 3, left. Second (2), given the VLM-informed language prompt, a causal LLM (i.e., for text completion) generates several n candidate captions. For this step, we use a non-zero next-token sampling temperature (e.g., 0.9 for GPT-3), to return sufficiently diverse, but reasonable results across the n candidates. |