Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Fool Your (Vision and) Language Model with Embarrassingly Simple Permutations
Authors: Yongshuo Zong, Tingyang Yu, Ruchika Chavhan, Bingchen Zhao, Timothy Hospedales
ICML 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Specifically, we show empirically that popular models are vulnerable to adversarial permutation in answer sets for multiple-choice prompting, which is surprising as models should ideally be as invariant to prompt permutation as humans are. |
| Researcher Affiliation | Academia | 1University of Einburgh 2EPFL. |
| Pseudocode | No | The paper provides a mathematical equation for the attack but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ ys-zong/Foolyour VLLMs. |
| Open Datasets | Yes | All of the datasets we use are publicly available. Specifically, for LLMs, we utilize MMLU (Hendrycks et al., 2020), ARC challenge (ARC-c) (Clark et al., 2018), Bool Q (Clark et al., 2019), Sociali QA (Sap et al., 2019), and Med MCQA (Pal et al., 2022). For VLLMs, we use Science QA (Lu et al., 2022), A-OKVQA (Schwenk et al., 2022), MMBench (Liu et al., 2023c), and SEED-Bench (Li et al., 2023a). |
| Dataset Splits | Yes | Specifically, for LLMs, we utilize MMLU (Hendrycks et al., 2020), ARC challenge (ARC-c) (Clark et al., 2018), Bool Q (Clark et al., 2019), Sociali QA (Sap et al., 2019), and Med MCQA (Pal et al., 2022). For VLLMs, we use Science QA (Lu et al., 2022), A-OKVQA (Schwenk et al., 2022), MMBench (Liu et al., 2023c), and SEED-Bench (Li et al., 2023a). ... As many of the benchmarks do not provide a training set, we conduct two fine-tuning experiments using Llama2-7B on two datasets that do provide training sets: ARC-Challenge (Clark et al., 2018) and Med MCQA (Pal et al., 2022). |
| Hardware Specification | Yes | Experiments are conducted on A100-80GB GPUs. |
| Software Dependencies | No | The paper mentions accessing model weights from Hugging Face or official repositories but does not explicitly list specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For both LLMs and VLLMs, we use greedy decoding to ensure reproducibility. ... We fine-tune with Lo RA (Hu et al., 2022) for 1 epoch. |