Mass-Producing Failures of Multimodal Systems with Language Models
Authors: Shengbang Tong, Erik Jones, Jacob Steinhardt
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate MULTIMON, we measure the quantity and quality of the systematic failures. We find that MULTIMON uncovers 14 systematic failures of the CLIP text-encoder, and from them over one thousand new individual failures (Section 4). The failures of the CLIP text-encoder transfer to downstream text-to-image, text-to-video, and text-to-3d systems (Figure 1, Section 5). Through a manual evaluation, we find that the systems err on 80.0% of the pairs generated by MULTIMON, compared to only 20.5% for a baseline system. |
| Researcher Affiliation | Academia | Shengbang Tong Erik Jones Jacob Steinhardt UC Berkeley {tsb, erjones, jsteinhardt}@berkeley.edu |
| Pseudocode | Yes | We provide pseudocode for MULTIMON in Algorithm 1. The algorithm also contains steps to steer scraping discussed in Section 4.2. |
| Open Source Code | Yes | Code for MULTIMON is available at https://github.com/tsb0601/Multi Mon |
| Open Datasets | Yes | For the input corpus we test both SNLI [Bowman et al., 2015] and MS-COCO Captions [Lin et al., 2014]. |
| Dataset Splits | No | The paper describes generating and evaluating new instances for testing MULTIMON's effectiveness, and using GPT-4 for categorization and generation. However, it does not specify traditional train/validation/test splits for the training of a model within the described methodology. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or the MULTIMON pipeline itself. It mentions using large language models (GPT-4, Claude, GPT-3.5) and evaluating on various multimodal systems, but not the underlying hardware for these operations. |
| Software Dependencies | Yes | For the language model categorizer, we consider GPT-4 [Open AI, 2023b], Claude v1.3 [Anthropic, 2023], and GPT-3.5 [Brockman et al., 2023], and use GPT-4 as a generator unless otherwise noted. |
| Experiment Setup | Yes | During the scraping stage, we return the n = 150 pairs with highest CLIP similarity, and use a semantic similarity threshold of τ = 0.7. We choose n such that this prompt fits in the model s context window. We set m to be the maximum number of examples the generator can empirically produce in a single response. To compute success rate, we use GPT-4 to generate k = 82 new instances and set the CLIP similarity threshold for success to be t = 0.88. |