Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion

Authors: Shoubin Yu, Jaehong Yoon, Mohit Bansal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including conventional Video QA and Video-Audio/3D/Touch/Thermal QA, and achieve better/equivalent performance against strong multimodal LLMs, including One LLM, BLIP-2, and Se Vi LA while reducing over 90% trainable parameters. We provide extensive analyses of CREMA, including the impact of each modality on reasoning domains, the design of the fusion module, and example visualizations.
Researcher Affiliation Academia Shoubin Yu Jaehong Yoon Mohit Bansal UNC Chapel Hill EMAIL
Pseudocode No The paper describes methods through textual descriptions and mathematical equations, but it does not contain any clearly labeled pseudocode or algorithm blocks in a structured format.
Open Source Code Yes We will make our code and models publicly accessible. 1Project Page: https://CREMA-Video LLM.github.io/.
Open Datasets Yes We evaluate CREMA on the following video reasoning and QA tasks: SQA3D (Ma et al., 2023), MUSIC-AVQA (Li et al., 2022), and NEx T-QA (Xiao et al., 2021). We further evaluate CREMA on Touch QA and Thermal QA collected by ourselves based on public video-touch (Touch&Go (Yang et al., 2022)) and video-thermal data (Thermal-IM (Tang et al., 2023b)). See Appendix (Sections A.1 and A.3) for more details.
Dataset Splits Yes Our Touch QA dataset contains 714 training data and 3212 test data. (5) Thouch QA: Similar to the Touch QA dataset, we build the Thermal QA dataset on public video-thermal heatmap dataset, Thermal-IM (Tang et al., 2023b). Thermal-IM contains action labels in each video and was originally designed to predict human pose 3 seconds ago according to both video and thermal heatmap. We reformulate it as a QA task as well by asking the model: What action might have occurred before this video? and the answer is action label. Our Thermal dataset contains 1131 training data and 391 test data. (6) Perception Test (P atr aucean et al., 2023): a multimodal benchmark designed to comprehensively evaluate the perception and reasoning skills of multimodal video models. We use the multi-choice QA part of this benchmark, which contains 1955 train data and 5260 validation data.
Hardware Specification Yes We conduct experiments with 4 48GB A6000 GPUs, we report baseline model training hyperparameters in Table 10. The experiments are conducted on the same 4 48GB A6000 GPUs machine. (Table 15: Tested on the single A6000 GPU.)
Software Dependencies No The paper mentions software components like Py Torch, Huggingface Transformers, and Torchvision in the License Information section, but it does not provide specific version numbers for these dependencies.
Experiment Setup Yes Table 10: Baseline models fine-tuning hyperparameters. Table 11: CREMA fine-tuning hyperparameters. Both tables specify Batch Size per GPU, Learning Rate, Warmup Epoch, and Gradient Accumulation Step for various datasets and modalities.