Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SharpZO: Hybrid Sharpness-Aware Vision Language Model Prompt Tuning via Forward-Only Passes
Authors: Yifan Yang, Zhen Zhang, Rupak Vignesh Swaminathan, Jing Liu, Nathan Susanj, Zheng Zhang
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Detailed theoretical analysis and extensive experiments on CLIP models demonstrate that Sharp ZO significantly improves accuracy and convergence speed, achieving up to 7% average gain over state-of-the-art forward-only methods. In this section, we present experimental results to evaluate the performance of the proposed Sharp ZO method across a variety of downstream tasks using CLIP models with different architectures. Specifically, we compare the proposed Sharp ZO method with zero-shot (ZS) inference and other BP-free baselines like BBT [49], Black VIP [31], and ZIP [32] (Detailed descriptions for tasks and baslines method can be found in Appendix B). |
| Researcher Affiliation | Collaboration | Yifan Yang1, Zhen Zhang1, Rupak Vignesh Swaminathan2, Jing Liu2, Nathan Susanj2, Zheng Zhang1 1University of California, Santa Barbara 2Amazon AGI EMAIL EMAIL EMAIL |
| Pseudocode | Yes | Algorithm 1 Sharp ZO: Hybrid Sharpness-Aware Zeroth-Order Optimization |
| Open Source Code | Yes | Answer: [Yes] Justification: We will release the code, the anonymous code is provided in the supplemental material with documentation |
| Open Datasets | Yes | Datasets: Following the experimental setup of prior VLMs fine-tuning works [43, 32], we evaluate Sharp ZO on 11 diverse image classification benchmarks under a few-shot learning scenario. These datasets cover a broad range of tasks: generic object recognition with Image Net [9] and Caltech101 [42], fine-grained image classification with Oxford Pets [33], Stanford Cars [23], Flowers102 [30], Food101 [4], and FGVCAircraft [27], satellite image classification with Euro SAT [20], texture recognition with DTD [6], scene classification with SUN397 [44], and action recognition with UCF101 [36]. |
| Dataset Splits | Yes | All experiments use a 16-shot setup unless otherwise specified. All hyper-parameter search are performed on a 5-shot validation set extracted from the official validation set or splitted from the training set (e.g. Image Net). |
| Hardware Specification | Yes | The time is recorded in minutes. Tested on single Nvidia A100-40G GPU. |
| Software Dependencies | No | The paper mentions general software components like CLIP, ResNet, ViT, and Transformers but does not specify their version numbers or any other software dependencies with specific versions. |
| Experiment Setup | Yes | Training Detail: For the VLM model, we utilize CLIP [34] with both Res Net [19] and Vi T [10] backbones as the visual encoder, and Transformers [41] as the text encoder. The CLIP weights are initialized from the official pretrained checkpoints and remain frozen during training. The prompt generator use initial prompt with length of 4, and hidden dimension d = 512. Parameters in w are initialized from a Gaussian distribution N(0, 0.02). Detail hyper-parameter setup for Sharp ZO method on various tasks can be found in Table. 8 in Appendix C.2. |