Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Rebalancing Contrastive Alignment with Bottlenecked Semantic Increments in Text-Video Retrieval

Authors: Jian Xiao, Zijie Song, Jialong Hu, Hao Cheng, Zhenzhen Hu, Jia Li, Richang Hong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on four benchmarks demonstrate that GARE consistently improves alignment accuracy and robustness, validating the effectiveness of gap-aware tension mitigation. Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
Researcher Affiliation Academia 1School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China 2School of Big Data and Statistics, Anhui University, Hefei, China
Pseudocode Yes Figure 10: Inference pseudocode of GARE. are computed on-the-fly and discarded immediately to ensure constant memory usage during large-scale retrieval. Figure 11: Simplified implementation of ψ s cross-attention operation.
Open Source Code Yes Code is available at https://github.com/musicman217/GARE-text-video-retrieval.
Open Datasets Yes We evaluate our method on four standard text-video retrieval benchmarks: MSR-VTT [52], Di De Mo [2], MSVD [8], and Activity Net Captions [27].
Dataset Splits Yes MSR-VTT contains 10K videos with 20 captions each; we follow the 1K-A validation split.
Hardware Specification Yes All experiments are conducted on 4 to 8 GPUs including RTX 4090, A100 and V100.
Software Dependencies No The paper mentions using CLIP (Vi T-B/32) as a base dual-encoder and Adam optimizer, but does not provide specific version numbers for crucial software libraries (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes We adopt CLIP (Vi T-B/32) [38] as the base dual-encoder, equipped with a 4-layer Temporal Transformer [42] following the CLIP vision encoder for video encoding. Following prior works [34, 17, 30, 45], we use 32-word captions and 12 video frames for MSR-VTT and MSVD, and 64-word captions with 64 frames for Di De Mo and Activity Net Captions due to their longer video durations. We use the Adam optimizer [14] with linear warm-up, as in prior works. The learning rate is set to 1e 7 for CLIP s text and visual encoders, and 1e 4 for all other modules. We set β = 0.07, τ = 0.01, α = 2, and λ = 0.5 for MSR-VTT. All experiments use a batch size of 128. We train the model for 5 epochs on MSR-VTT, MSVD, and Di De Mo, and 10 epochs on Activity Net Captions.