Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities

Authors: Hammad Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray, Shih-Fu Chang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated videoarticle pairs from Multi Hi Eve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research.
Researcher Affiliation Collaboration Hammad Ayyubi1, Christopher Thomas2, Lovish Chum1, Rahul Lokesh3, Long Chen4, Yulei Niu1, Xudong Lin1, Xuande Feng1, Jaywon Koo1, Sounak Ray1, Shih-Fu Chang1 1Columbia University 2Virginia Tech 3Samsung Research America 4The Hong Kong University of Science and Technology
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper provides a link for "Data: https://github.com/hayyubi/multihieve" and states "We release Multi Hi Eve dataset to facilitate research on this task." This link is explicitly for the dataset and does not clearly state it hosts the source code for the methodology described in the paper.
Open Datasets Yes To support research on the proposed task, we introduce Multi Hi Eve a dataset containing news articles and the associated video clips. ... Data: https://github.com/hayyubi/multihieve
Dataset Splits Yes We split the data two ways 1) 100K unannotated train split for self-supervised/weakly supervised training and 2) 526 annotated test split 249 validation set and 277 test set for benchmarking and evaluation.
Hardware Specification Yes We train our model for 15 epochs using a batch size of 1024 and a learning rate of 1e-5 on 4 NVIDIA Tesla v100 GPUs for a total training time of around 34 hours.
Software Dependencies Yes We use the same automatic methods to detect them as used on the test data: Open Domain IE (Shen et al. 2021) and open source library Py Scene Detect 3 for text event and video event detection respectively. ... As CLIP (Radford et al. 2021) model has demonstrated state-of-the-art performance in multimodal retrieval tasks... we use it for this step.
Experiment Setup Yes Notably, most text event and video event pairs are unrelated (94.52% in the train set). To mitigate label bias, we adjust the labels in the cross-entropy loss using the inverse ratio of their count in the train set, following Wang et al. (2021). Our best model uses a single layer of multi-headed attention in CT. We train our model for 15 epochs using a batch size of 1024 and a learning rate of 1e-5 on 4 NVIDIA Tesla v100 GPUs for a total training time of around 34 hours.