Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Mitigating Semantic Collapse in Partially Relevant Video Retrieval
Authors: WonJun Moon, MinSeok Jung, Gilhan Park, Tae-Young Kim, Cheol-Ho Cho, Woojin Jun, Jae-Pil Heo
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on PRVR benchmarks demonstrate that our framework effectively prevents semantic collapse and substantially improves retrieval accuracy. |
| Researcher Affiliation | Academia | Sungkyunkwan University EMAIL |
| Pseudocode | Yes | Algorithm 1 Order-Preserving Token Merging (OP-To Me) ... Algorithm 2 Pre-computing the different levels of clip number (Eq. 9) ... Algorithm 3 Constructing merged clips for Adaptive CBVA |
| Open Source Code | Yes | Code will be released (https://github.com/admins97/MSC_PRVR). |
| Open Datasets | Yes | We evaluated our method on four PRVR datasets: QVHighlights [24], TVR [25], Activity Net Captions [22], and Charades-STA [12]. |
| Dataset Splits | Yes | TVR [25]... The training set contains 17,435 videos and 87,175 queries, while the evaluation set includes 2,179 videos and 10,895 queries. Activity Net Captions [22]... The dataset includes 10,009 videos for training and 4,917 for evaluation. Charades-STA [12]... It consists of 13,898 video-sentence pairs for training and 4,233 for evaluation. |
| Hardware Specification | Yes | All experiments are conducted on a single RTX A6000 GPU and an Intel Xeon Gold 6338 CPU (2.00GHz) for all datasets. |
| Software Dependencies | No | For feature extraction, we follow recent works [5, 33, 32]; we extract video features with CLIP-B/32 [37] and Slowfast [10], and use CLIP-B for text embeddings for QVHighlights, and use CLIP-L [37] for encoding both modalities in other datasets. Hyperparameter configurations are adopted from GMMFormer-v2 [46] (e.g., learning rate, batch size, epochs, and optimizer settings) except for the fusing ratio between the frame and clip branches. |
| Experiment Setup | Yes | Hyperparameter configurations are adopted from GMMFormer-v2 [46] (e.g., learning rate, batch size, epochs, and optimizer settings) except for the fusing ratio between the frame and clip branches. We assign a frame score weight of 0.6 and a clip score weight of 0.4. All loss coefficients are fixed across datasets: λE = 15, λA = 30, and λCBVA = 0.1. To construct consistent clips with OP-To Me, we set N to 75%... Finally, we set the minimum clip count per video to Cmin = 5, and set a similarity threshold τ to 0.7 for QVHighlights, 0.8 for TVR and Activity Net-Captions, and 0.85 for Charades. |