Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Authors: Silin Cheng, Kai Han
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on few-shot and domain generalization benchmarks show that Va MP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp |
| Researcher Affiliation | Academia | Silin Cheng Kai Han Visual AI Lab, The University of Hong Kong EMAIL EMAIL |
| Pseudocode | No | The paper describes the methodology using textual explanations, mathematical equations, and diagrams (e.g., Figure 1), but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | (2) Code: Code will be available after paper got accepted. |
| Open Datasets | Yes | (1) Data: All the datasets we used in this paper are publicly available online, and all the readers are free to download them. We list the statistics of all the used datasets in the supplementary material. ... This evaluation is conducted across 11 diverse classification datasets: Image Net [72], Caltech101 [73], Oxford Pets [74], Stanford Cars [75], Flowers102 [76], Food101 [77], FGVCAircraft [78], SUN397 [79], UCF101 [80], DTD [81], and Euro SAT [82]. |
| Dataset Splits | Yes | In this setting, dataset classes are split into base and novel classes. ... All conducted under a 16-shot setting, where each category has only 16 training examples. ... Details of 14 datasets are shown in Table A1. Table A1: Summary of the 14 datasets. ... Caltech101 [73] 100 4,128 1,649 2,465 Recognition of generic objects |
| Hardware Specification | Yes | All experiments are conducted on a single NVIDIA V100 GPU. |
| Software Dependencies | No | All models are trained using the Adam W optimizer with a learning rate of 0.001 and a weight decay of 0.01. The batch size is set to 32 for Image Net and 4 for all other datasets. We apply automatic mixed-precision training throughout to improve efficiency. For base-to-novel generalization on Image Net, we train for 5 epochs; for other datasets we train for 10 epochs. For cross-dataset and domain generalization tasks, we train on Image Net for a single epoch. Few-shot learning tasks use 5 training epochs on Image Net and 50 epochs on target datasets. All reported results are averaged over three independent runs. All prompts and representation tokens are initialized from a zero-mean Gaussian distribution with a standard deviation of 0.02. For Euro SAT, we follow MMRL [8] and set the representation token dimension dr = 2048; for all other datasets, we use dr = 512. The fusion parameter α in MMRL-style classifiers is fixed to 0.7. The average accuracy is reported over three independent runs. For variational modeling, we use a two-layer MLP with GELU activation to parameterize both the posterior network ϕ and the prior network ψ, outputting mean and log-variance vectors per layer. The latent variables z are sampled using the reparameterization trick, and we perform S = 10 Monte Carlo samples at inference time. Class prototypes oy are computed offline at the start of training. |
| Experiment Setup | Yes | All models are trained using the Adam W optimizer with a learning rate of 0.001 and a weight decay of 0.01. The batch size is set to 32 for Image Net and 4 for all other datasets. We apply automatic mixed-precision training throughout to improve efficiency. For base-to-novel generalization on Image Net, we train for 5 epochs; for other datasets we train for 10 epochs. For cross-dataset and domain generalization tasks, we train on Image Net for a single epoch. Few-shot learning tasks use 5 training epochs on Image Net and 50 epochs on target datasets. All reported results are averaged over three independent runs. All prompts and representation tokens are initialized from a zero-mean Gaussian distribution with a standard deviation of 0.02. For Euro SAT, we follow MMRL [8] and set the representation token dimension dr = 2048; for all other datasets, we use dr = 512. The fusion parameter α in MMRL-style classifiers is fixed to 0.7. The average accuracy is reported over three independent runs. For variational modeling, we use a two-layer MLP with GELU activation to parameterize both the posterior network ϕ and the prior network ψ, outputting mean and log-variance vectors per layer. The latent variables z are sampled using the reparameterization trick, and we perform S = 10 Monte Carlo samples at inference time. Class prototypes oy are computed offline at the start of training. |