Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Unlabeled Data Improves Fine-Grained Image Zero-shot Classification with Multimodal LLMs
Authors: Yunqi Hong, Sohyun An, Andrew Bai, Neil Lin, Cho-Jui Hsieh
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on multiple fine-grained classification datasets. It consistently outperforms other unsupervised baselines, demonstrating the effectiveness of our self-supervised optimization framework. Notably, Auto SEP in average improves 13% over standard zero-shot classification and 3% over the best-performing baselines. |
| Researcher Affiliation | Academia | Yunqi Hong1, Sohyun An1, Andrew Bai1, Neil Y. C. Lin2, Cho-Jui Hsieh1 1Computer Science Department, University of California, Los Angeles 2Mechanical and Aerospace Engineering Department, University of California, Los Angeles EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 Auto SEP: Automatic Self-Enhancing Prompt Learning |
| Open Source Code | Yes | Code is available at https://github.com/yq-hong/Auto SEP. |
| Open Datasets | Yes | We conduct experiments on four fine-grained image classification datasets, including CUB-200-2011 [29] (bird classification), i Naturalist 2021 [28] (various wild species), Stanford Dogs [6], and Veg Fru [25]. |
| Dataset Splits | No | The paper mentions evaluating on 'datasets that are balanced in class distribution' and constructing 'subsets of categories' for tasks, but does not provide specific training/test/validation split percentages, sample counts, or references to predefined splits with version numbers/citations for their specific experimental setup. |
| Hardware Specification | Yes | We performed the experiments on local servers with 64 CPU cores and 4 Nvidia A6000 GPUs. |
| Software Dependencies | No | The paper mentions using state-of-the-art MLLMs like Gemini 1.5 Flash [27], GPT-4o [14], and Qwen2-VL-72B-Instruct [30], but does not specify programming language versions (e.g., Python) or library versions (e.g., PyTorch, TensorFlow, CUDA) used in their implementation of Auto SEP. |
| Experiment Setup | Yes | In our implementation, we typically set b = 4, l 5, and k = 2, resulting in an approximate query complexity of O(60n) per iteration. |