Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
With Limited Data for Multimodal Alignment, Let the STRUCTURE Guide You
Authors: Fabian Gröger, Shuo Wen, Huyen Le, Maria Brbic
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate model performance on zero-shot classification and cross-modal retrieval tasks. For zero-shot classification, we consider 22 datasets from the CLIP benchmark [7]. For cross-modal retrieval, we consider the Flickr30 and MS COCO test splits to evaluate both text-to-image and image-to-text performance. |
| Researcher Affiliation | Academia | Fabian Gröger1,2,3, , Shuo Wen1, , Huyen Le1, Maria Brbi c1 1EPFL 2University of Basel 3HSLU |
| Pseudocode | Yes | Listing 1: Py Torch reference implementation of the STRUCTURE regularizer RS(X, A) |
| Open Source Code | No | Yes, code will be made public once the paper is de-anonymized. |
| Open Datasets | Yes | To align models, we use the MS COCO train split consisting of 80,000 paired samples [33]. To evaluate model performance on zero-shot classification and cross-modal retrieval tasks. For zero-shot classification, we consider 22 datasets from the CLIP benchmark [7]. For cross-modal retrieval, we consider the Flickr30 and MS COCO test splits to evaluate both text-to-image and image-to-text performance. |
| Dataset Splits | Yes | To align models, we use the MS COCO train split consisting of 80,000 paired samples [33]. ... Table 4: List of benchmarks we used for both zero-shot classification and retrieval evaluation. Task Dataset Number of Classes Train size Test size ... Food101 [45] 101 75,750 25,250 |
| Hardware Specification | Yes | Our experiments used a cluster of 8 NVIDIA Ge Force RTX 3090 GPUs, but each individual training run required only a single GPU and less than 4GB of VRAM and took at most 2 hours. |
| Software Dependencies | No | The paper does not explicitly state specific version numbers for software dependencies like PyTorch, only mentioning its use implicitly in the pseudocode listing. |
| Experiment Setup | Yes | Table 3: List of default hyperparameters used throughout the paper. Category Hyperparameter Value Alignment training Epochs 1,000 Batch size 4,096 Learning rate scheduler Cosine Auto learning rate finder [40] Gradient clipping 1.0 Early stopping epochs 200 Optimizer Adam W Weight decay 0.0001 Alignment objective Temperature τ 0.05 RS levels L 1 λ 10.0 λ warmup Linear λ warmup steps 1,000 Alignment layer Output dimension 512 |