Mass-Producing Failures of Multimodal Systems with Language Models

Authors: Shengbang Tong, Erik Jones, Jacob Steinhardt

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate MULTIMON, we measure the quantity and quality of the systematic failures. We find that MULTIMON uncovers 14 systematic failures of the CLIP text-encoder, and from them over one thousand new individual failures (Section 4). The failures of the CLIP text-encoder transfer to downstream text-to-image, text-to-video, and text-to-3d systems (Figure 1, Section 5). Through a manual evaluation, we find that the systems err on 80.0% of the pairs generated by MULTIMON, compared to only 20.5% for a baseline system.
Researcher Affiliation Academia Shengbang Tong Erik Jones Jacob Steinhardt UC Berkeley {tsb, erjones, jsteinhardt}@berkeley.edu
Pseudocode Yes We provide pseudocode for MULTIMON in Algorithm 1. The algorithm also contains steps to steer scraping discussed in Section 4.2.
Open Source Code Yes Code for MULTIMON is available at https://github.com/tsb0601/Multi Mon
Open Datasets Yes For the input corpus we test both SNLI [Bowman et al., 2015] and MS-COCO Captions [Lin et al., 2014].
Dataset Splits No The paper describes generating and evaluating new instances for testing MULTIMON's effectiveness, and using GPT-4 for categorization and generation. However, it does not specify traditional train/validation/test splits for the training of a model within the described methodology.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments or the MULTIMON pipeline itself. It mentions using large language models (GPT-4, Claude, GPT-3.5) and evaluating on various multimodal systems, but not the underlying hardware for these operations.
Software Dependencies Yes For the language model categorizer, we consider GPT-4 [Open AI, 2023b], Claude v1.3 [Anthropic, 2023], and GPT-3.5 [Brockman et al., 2023], and use GPT-4 as a generator unless otherwise noted.
Experiment Setup Yes During the scraping stage, we return the n = 150 pairs with highest CLIP similarity, and use a semantic similarity threshold of τ = 0.7. We choose n such that this prompt fits in the model s context window. We set m to be the maximum number of examples the generator can empirically produce in a single response. To compute success rate, we use GPT-4 to generate k = 82 new instances and set the CLIP similarity threshold for success to be t = 0.88.