Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Aligning by Misaligning: Boundary-aware Curriculum Learning for Multimodal Alignment

Authors: Hua Ye, Hang Ding, Siyuan Chen, Yiyang Jiang, changyuan zhang, Xuan Zhang

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The paper includes extensive experimental validation in Section 5 titled "Experiments," with subsections such as "5.1 Experimental Setup," "5.2 Main Results," "5.3 Ablation Study," "5.4 Hard-Negative Mining Study," and "5.5 Cross-modal Generalisation." It evaluates performance on multiple large-scale datasets (LAION-400M, Web Vid-10M, VAST-27M, Wav Text5K) using various metrics (R@1, R@5, R@10, mAP, nDCG, Accuracy, F1, Recall, MRR) and compares against several baselines. The paper also conducts ablation studies to quantify the individual contributions of its components.
Researcher Affiliation Collaboration The authors are affiliated with a mix of academic institutions (Nanjing University, Shanghai Jiao Tong University, University of Bristol, The Hong Kong Polytechnic University, The University of Hong Kong, Carnegie Mellon University) and an industry entity (Airon Technology CO., LTD).
Pseudocode Yes The paper includes a clearly labeled algorithm block titled "Algorithm 1 BACL: Boundary-aware Curriculum Learning for Multimodal Alignment" in Appendix B, as referenced in the main text: "Algorithm 1 in Appendix B outlines our BACL training pipeline."
Open Source Code No In the NeurIPS Paper Checklist, the authors state: "Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Due to institutional restrictions and proprietary considerations, the data and code used in this study are not publicly available at this time."
Open Datasets Yes The paper utilizes and cites several well-known publicly available datasets: "(i) the LAION-400M image text corpus [Schuhmann et al., 2021]; (ii) the Web Vid-10M video text collection [Bain et al., 2021]; (iii) the VAST-27M tri-modal dataset of video, audio, and subtitles [Chen et al., 2023]; and (iv) the Wav Text5K audio text benchmark [Deshmukh et al., 2022]." Furthermore, it states in Appendix C: "All datasets are released under permissive licenses (e.g. CC-BY-4.0); we strictly follow the original creators data-usage terms."
Dataset Splits Yes The paper explicitly describes dataset splits in Appendix C, "Dataset Details": For LAION-400M, "We keep the official training partition (398 M pairs) for unsupervised pre-training and randomly sample 50 k pairs for validation. Retrieval evaluation follows the standard 30 k image query split...". For Web Vid-10M, "We adopt the pre-train split (10.1 M) for curriculum mining and the canonical val split (40 k) for retrieval...". For VAST-27M, "We use the official train/val/test splits (26 M / 0.5 M / 0.5 M)." For Wav Text5K, "We use the public train/val/test splits (3 742 / 640 / 741)."
Hardware Specification Yes The paper specifies hardware details in Appendix E, "Implementation Details": "All models are pre-trained for ten epochs on each dataset with a global batch size of 16 384 (512 per GPU, 32 A100)." and in Table 8, "Consolidated efficiency metrics on LAION-400M (batch=512). Iteration rate measured on 8 A100-40GB; memory on a single A100-40GB." This explicitly mentions NVIDIA A100 GPUs with 40GB memory.
Software Dependencies No The paper mentions specific model architectures like "CLIP Vi T-B/16 (visual), GELU-Ro BERTa (text), and CLAP PANN14 (audio) backbones" and the "Adam W" optimizer, but it does not provide specific version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages (e.g., Python, CUDA).
Experiment Setup Yes The paper provides extensive details on the experimental setup in Appendix E, "Implementation Details." This includes model architecture parameters (e.g., "4-layer cross-modal Transformer with hidden size 512"), BNS policy network details ("two-layer MLP (512-128-1) with Si LU activation. Gumbel-Softmax temperature τ is initialised at 0.7 and linearly annealed to 0.1. Logistic schedule parameters are set to αearly=0.3, αlate= 0.5, γ=1.5, and η0 equal to 40% of the total pre-training epochs."), CLA parameters ("top 15% token pairs... gain coefficient β is fixed at 2.0 and λlocal at 0.3."), and training hyperparameters ("ten epochs on each dataset with a global batch size of 16 384 (512 per GPU, 32 A100). Adam W weight decay is set to 1e-2 and learning rate to 2e-4 with cosine decay."). It also mentions finetuning hyperparameters for VQA v2 and NLVR2 are in Appendix F.