Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

LOMIA: Label-Only Membership Inference Attacks against Pre-trained Large Vision-Language Models

Authors: Yihao LIU, Xinqi Lyu, Dong Wang, Yanjie Li, Bin Xiao

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive evaluations conducted on two datasets and three open-source pre-trained VLLMs demonstrate that LOMIA performs comparably to existing logits-based attacks across a range of evaluation metrics. We also show the effectiveness of our methods on the closed-source model GPT-4o, which achieved an AUC of 0.669 when evaluating the image-text feature attack method (ITFA). In this section, we conduct MIAs across three target models using various baselines, and our own methods: TTFA, ITFA, and DUFA. The evaluation setup is detailed in Section 5.1. Results for TTFA, ITFA, and DUFA are presented in Section 5.2. Additionally, an ablation study is included in Section 5.3.
Researcher Affiliation	Academia	Yihao Liu, Xinqi Lyu, Dong Wang, Yanjie Li, Bin Xiao Department of Computing, The Hong Kong Polytechnic University EMAIL, EMAIL
Pseudocode	No	The paper describes the methods (TTFA, ITFA, DUFA) using 'Regression Stage' and 'Inference Stage' sections with textual descriptions and mathematical formulas, but it does not present any explicitly labeled 'Pseudocode' or 'Algorithm' blocks in a structured, code-like format.
Open Source Code	No	We will consider releasing the code upon acceptance.
Open Datasets	Yes	Pre-trained VLLMs such as LLa VA-1.5, Mini GPT-4, and LLa MA-Adapter V2 use images from the LAION [41], Conceptual Captions 3M [6], Conceptual 12M [6], and SBU Captions [37] datasets for pre-training [27]. Following Li et al. [27], we randomly sample a subset from the intersection of the datasets used by these three pre-trained VLLMs to serve as the member data. We then use the captions of the member data as input to query the stable-diffusion-v1-5 [40] to generate images that serve as non-member data. To ensure the validity of our MIA on VLLMs, we have 600 images in LOMIA/LAION (300 members and 300 non-members). LOMIA/CC. MS COCO [29] is also a popular dataset used in the pre-training process of the target models, so we randomly select some images in this dataset as member data. We use a similar approach to generate non-member data with stable-diffusion-v1-5 [40]. We also have 600 images in LOMIA/CC (300 members and 300 non-members).
Dataset Splits	No	The paper states the total number of member and non-member images in the LOMIA/LAION and LOMIA/CC datasets (e.g., '600 images in LOMIA/LAION (300 members and 300 non-members)'). However, it does not explicitly describe how these datasets are partitioned into training, validation, and test sets for the purpose of training the regression model within the LOMIA framework and then evaluating its attack performance on unseen data from these datasets.
Hardware Specification	Yes	The attack implementation is conducted on 4 NVIDIA 3090 GPUs.
Software Dependencies	No	The paper mentions using specific models like 'all-Mini LM-L6-v2', 'CLIP', and 'stable-diffusion-v1-5' for various tasks, but it does not provide specific version numbers for these models or other core software dependencies (e.g., Python, PyTorch, CUDA, or other libraries) required for replication.
Experiment Setup	Yes	To mitigate the impact of varying text lengths, we fix the maximum token length at 32. For computing text-text similarity, we use the all-Mini LM-L6-v2 model [21]... Image-text similarity is calculated using CLIP [39]... For Query attack, we set the number of queries to 5 and the temperature to 0.5, which performs the best. Ablation studies also investigate different temperature settings (Figure 2b) and max token length (Figure 2a).