Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You

Authors: Fabian Gröger, Shuo Wen, Huyen Le, Maria Brbic

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate model performance on zero-shot classification and cross-modal retrieval tasks. For zero-shot classification, we consider 22 datasets from the CLIP benchmark [7]. For cross-modal retrieval, we consider the Flickr30 and MS COCO test splits to evaluate both text-to-image and image-to-text performance.
Researcher Affiliation Academia Fabian Gröger1,2,3, , Shuo Wen1, , Huyen Le1, Maria Brbi c1 1EPFL 2University of Basel 3HSLU
Pseudocode Yes Listing 1: Py Torch reference implementation of the STRUCTURE regularizer RS(X, A)
Open Source Code No Yes, code will be made public once the paper is de-anonymized.
Open Datasets Yes To align models, we use the MS COCO train split consisting of 80,000 paired samples [33]. To evaluate model performance on zero-shot classification and cross-modal retrieval tasks. For zero-shot classification, we consider 22 datasets from the CLIP benchmark [7]. For cross-modal retrieval, we consider the Flickr30 and MS COCO test splits to evaluate both text-to-image and image-to-text performance.
Dataset Splits Yes To align models, we use the MS COCO train split consisting of 80,000 paired samples [33]. ... Table 4: List of benchmarks we used for both zero-shot classification and retrieval evaluation. Task Dataset Number of Classes Train size Test size ... Food101 [45] 101 75,750 25,250
Hardware Specification Yes Our experiments used a cluster of 8 NVIDIA Ge Force RTX 3090 GPUs, but each individual training run required only a single GPU and less than 4GB of VRAM and took at most 2 hours.
Software Dependencies No The paper does not explicitly state specific version numbers for software dependencies like PyTorch, only mentioning its use implicitly in the pseudocode listing.
Experiment Setup Yes Table 3: List of default hyperparameters used throughout the paper. Category Hyperparameter Value Alignment training Epochs 1,000 Batch size 4,096 Learning rate scheduler Cosine Auto learning rate finder [40] Gradient clipping 1.0 Early stopping epochs 200 Optimizer Adam W Weight decay 0.0001 Alignment objective Temperature τ 0.05 RS levels L 1 λ 10.0 λ warmup Linear λ warmup steps 1,000 Alignment layer Output dimension 512