Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
AmorLIP: Efficient Language-Image Pretraining via Amortization
Authors: Haotian Sun, Yitong Li, Yuchen Zhuang, Niao He, Hanjun Dai, Bo Dai
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across 38 downstream tasks demonstrate the superior zero-shot classification and retrieval capabilities of AMORLIP, consistently outperforming standard CLIP baselines with substantial relative improvements of up to 12.24%. |
| Researcher Affiliation | Collaboration | Georgia Institute of Technology Swiss Federal Institute of Technology Precur.ai EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1: AMORLIP Framework |
| Open Source Code | Yes | Our implementation is available at https://github.com/haotiansun14/Amor LIP. |
| Open Datasets | Yes | We pretrain models at two scales: a medium-scale experiment using Res Net-50 [31] trained on Conceptual Captions 3M (CC-3M; [57]) with a batch size of 1024 for 30 epochs, and a large-scale experiment using Vi T-B/32 trained on Conceptual Captions 12M (CC-12M; [10]) with a batch size of 2048 for 33 epochs. |
| Dataset Splits | Yes | We evaluate AMORLIP and baseline methods using the Data Comp benchmark [23], comprising 38 widely used text-image tasks. Specifically, we report top-1 zero-shot classification accuracy on Image Net (IN-1K; [55]) and six of its distribution-shifted variants: Image Net Sketch (IN-Sk; [65]), Image Net-V2 (IN-V2; [53]), Image Net-A (IN-A; [33]), Image Net-O (IN-O; [33]), Image Net-R (IN-R; [32]), and Object Net (Obj N; [4]). Additionally, we evaluate retrieval performance via recall@1 on Flickr30k (Flickr; [69]) and MSCOCO (COCO; [14]). |
| Hardware Specification | Yes | All experiments are conducted using NVIDIA H100 GPUs with 80GB VRAM. |
| Software Dependencies | No | We use the Open CLIP [36] codebase and original implementations for these models. For training the encoders, we employ the Adam W optimizer [40] with a learning rate of 1 10 3 for the medium-scale setting and 4 10 4 for the large-scale setting. For updating the amortization network, we use the Adam optimizer [40] universally set at a learning rate of 1 10 3. |
| Experiment Setup | Yes | We pretrain models at two scales: a medium-scale experiment using Res Net-50 [31] trained on Conceptual Captions 3M (CC-3M; [57]) with a batch size of 1024 for 30 epochs, and a large-scale experiment using Vi T-B/32 trained on Conceptual Captions 12M (CC-12M; [10]) with a batch size of 2048 for 33 epochs. [...] In AMORLIP, we implement λθl using a three-layer MLP for each modality l {I, T}... we choose fd = 0.5 for the mediumscale setting and fd = 1.0 for the large-scale setting. For amortization hyperparameters detailed in Algorithm 1, we set Tλ = 3 and Ttarget = 2 for both training scales, while Tonline is set to 8 for medium-scale and 1 for large-scale experiments. Regarding techniques described in Section 3.3, the EMA factor α is set to 0.999 for medium-scale and 0.92 for large-scale training. The parameter βT is universally fixed at 0.8. [...] For training the encoders, we employ the Adam W optimizer [40] with a learning rate of 1 10 3 for the medium-scale setting and 4 10 4 for the large-scale setting. For updating the amortization network, we use the Adam optimizer [40] universally set at a learning rate of 1 10 3. |