Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs

Authors: Yunqi Hong, Sohyun An, Andrew Bai, Neil Lin, Cho-Jui Hsieh

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, Auto SEP in average improves 13% over standard zero-shot classification and 3% over the best-performing baselines.
Researcher Affiliation	Academia	Yunqi Hong1, Sohyun An1, Andrew Bai1, Neil Y. C. Lin2, Cho-Jui Hsieh1 1Computer Science Department, University of California, Los Angeles 2Mechanical and Aerospace Engineering Department, University of California, Los Angeles EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Auto SEP: Automatic Self-Enhancing Prompt Learning
Open Source Code	Yes	Code is available at https://github.com/yq-hong/Auto SEP.
Open Datasets	Yes	We conduct experiments on four fine-grained image classification datasets, including CUB-200-2011 [29] (bird classification), i Naturalist 2021 [28] (various wild species), Stanford Dogs [6], and Veg Fru [25].
Dataset Splits	No	The paper mentions evaluating on 'datasets that are balanced in class distribution' and constructing 'subsets of categories' for tasks, but does not provide specific training/test/validation split percentages, sample counts, or references to predefined splits with version numbers/citations for their specific experimental setup.
Hardware Specification	Yes	We performed the experiments on local servers with 64 CPU cores and 4 Nvidia A6000 GPUs.
Software Dependencies	No	The paper mentions using state-of-the-art MLLMs like Gemini 1.5 Flash [27], GPT-4o [14], and Qwen2-VL-72B-Instruct [30], but does not specify programming language versions (e.g., Python) or library versions (e.g., PyTorch, TensorFlow, CUDA) used in their implementation of Auto SEP.
Experiment Setup	Yes	In our implementation, we typically set b = 4, l 5, and k = 2, resulting in an approximate query complexity of O(60n) per iteration.