Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

A TRIANGLE Enables Multimodal Alignment Beyond Cosine Similarity

Authors: Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive evaluation in three-modal tasks such as video-text and audio-text retrieval or audio-video classification, demonstrates that TRIANGLE achieves state-of-the-art results across different datasets improving the performance of cosine-based methods up to 9 points of Recall@1.
Researcher Affiliation Academia Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello Department of Information Engineering, Electronics, and Telecommunications Sapienza University of Rome, Italy {name.surname}@uniroma1.it
Pseudocode No The method is described using mathematical formulations and descriptive text, but no explicit pseudocode or algorithm blocks are provided in the paper.
Open Source Code Yes Code and checkpoints available at https://github.com/ispamm/TRIANGLE/.
Open Datasets Yes We utilize several benchmark datasets for our downstream tasks: MSR-VTT Xu et al. (2016) [...] Di De Mo Hendricks et al. (2017) [...] Activity Net Caba Heilbron et al. (2015) [...] VATEX Wang et al. (2019) [...] Audio Caps Kim et al. (2019) [...] VGGSound Chen et al. (2020).
Dataset Splits Yes Retrieval performance is evaluated every 100 steps on the MSR-VTT test set, and the checkpoint with the best performance is selected. We perform a deeper study on the ability of TRIANGLE to better model the latent space by letting TRIANGLE losses learn from scratch on the MSR-VTT dataset for the multimodal text-to-audio/video (T2AV) and audio/video-to-text (AV2T) tasks. In the training-from-scratch experiments we train from scratch the aforementioned encoders on the MSRVTT train dataset [...] We utilize several benchmark datasets for our downstream tasks: [...] Audio Caps [...] We follow the dataset split protocol proposed by ? for the text-to-audio retrieval task. VGGSound [...] Due to download limitations, we use a subset of 5,000 samples for testing.
Hardware Specification Yes TRIANGLE brings negligible increment of the computational time with only 0.0016 seconds to compute the area of three vectors of dimension 2048 against the 0.0001 seconds of the cosine similarity computation with a batch of size 256 on an RTX4080 in inference. We perform both pretraining and training from scratch of the TRIANGLE model using 4 A100 GPUs.
Software Dependencies No The paper mentions BERT-B, BEATs, and EVA-CLIP as backbone models/encoders, but does not provide specific software dependencies like programming language versions or library versions (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We pretrain the TRIANGLE model on top of VAST Chen et al. (2023b) (removing the fusing layers) on a subset of 150k samples randomly selected from the VAST27M dataset Chen et al. (2023b). We employ an initial learning rate of 1e-4 with a linear decay schedule and a batch size of 256. In the training-from-scratch experiments we train from scratch the aforementioned encoders on the MSRVTT train dataset for 4 epochs with an initial learning rate of 1e-4 with a linear decay schedule and a batch size of 64.