Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

PreFM: Online Audio-Visual Event Parsing via Predictive Future Modeling

Authors: Xiao Yu, Yan Fang, Yao Zhao, Yunchao Wei

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the Un AV-100 and LLP datasets show Pre FM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding.
Researcher Affiliation Academia Xiao Yu1,2 Yan Fang1,2 Yao Zhao1,2 Yunchao Wei1,2,B 1Institute of Information Science, Beijing Jiaotong University 2Visual Intelligence + X International Joint Laboratory BCorresponding Author EMAIL EMAIL
Pseudocode No The paper includes figures illustrating the architecture and process flow (e.g., Figure 2: The pipeline of Pre FM, Figure 3: Temporal-modality cross fusion), but does not contain explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes Code is available at https://github.com/Xiao Yu-1123/Pre FM.
Open Datasets Yes Extensive experiments on the Un AV-100 [13] and LLP [54] datasets show Pre FM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding.
Dataset Splits Yes For online scenarios, we concatenate LLP clips into longer video sequences. Specifically, half of these sequences are formed by randomly concatenating clips to simulate the rapid scene variations often encountered in online streaming content; the other half are formed by concatenating clips from the same event category to represent longer, continuous event occurrences. Following recent works [29, 77, 9, 49, 27, 71], segment-wise pseudo labels from CLIP [46, 22] and CLAP [8] are used for supervision.
Hardware Specification Yes All experiments are conducted on a single RTX 3090.
Software Dependencies No The paper mentions specific tools like the 'thop library' for measuring FLOPs, and uses 'Adam W' as an optimizer, as well as 'CLIP' and 'CLAP' for feature extraction. However, it does not provide specific version numbers for these software components.
Experiment Setup Yes For both tasks, we set 60 training epochs, with the first 10 epochs dedicated to warm-up. A batch size of 128 is used, and Adam W serves as the optimizer with a weight decay of 1e 4. We set the value Lc of 10 and Lf of 5 as the default setting. CLIP [46] and CLAP [8] are used to extract visual and audio features with a temporal stride set to 1 second, respectively. All experiments are conducted on a single RTX 3090. For the learning rate and the hidden dimension within the attention block, we use 1e 3 and 256 for On-AVEL, 5e 4 and 128 for On-AVVP.