Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities
Authors: Hammad Ayyubi, Christopher Thomas, Lovish Chum, Rahul Lokesh, Long Chen, Yulei Niu, Xudong Lin, Xuande Feng, Jaywon Koo, Sounak Ray, Shih-Fu Chang
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated videoarticle pairs from Multi Hi Eve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research. |
| Researcher Affiliation | Collaboration | Hammad Ayyubi1, Christopher Thomas2, Lovish Chum1, Rahul Lokesh3, Long Chen4, Yulei Niu1, Xudong Lin1, Xuande Feng1, Jaywon Koo1, Sounak Ray1, Shih-Fu Chang1 1Columbia University 2Virginia Tech 3Samsung Research America 4The Hong Kong University of Science and Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link for "Data: https://github.com/hayyubi/multihieve" and states "We release Multi Hi Eve dataset to facilitate research on this task." This link is explicitly for the dataset and does not clearly state it hosts the source code for the methodology described in the paper. |
| Open Datasets | Yes | To support research on the proposed task, we introduce Multi Hi Eve a dataset containing news articles and the associated video clips. ... Data: https://github.com/hayyubi/multihieve |
| Dataset Splits | Yes | We split the data two ways 1) 100K unannotated train split for self-supervised/weakly supervised training and 2) 526 annotated test split 249 validation set and 277 test set for benchmarking and evaluation. |
| Hardware Specification | Yes | We train our model for 15 epochs using a batch size of 1024 and a learning rate of 1e-5 on 4 NVIDIA Tesla v100 GPUs for a total training time of around 34 hours. |
| Software Dependencies | Yes | We use the same automatic methods to detect them as used on the test data: Open Domain IE (Shen et al. 2021) and open source library Py Scene Detect 3 for text event and video event detection respectively. ... As CLIP (Radford et al. 2021) model has demonstrated state-of-the-art performance in multimodal retrieval tasks... we use it for this step. |
| Experiment Setup | Yes | Notably, most text event and video event pairs are unrelated (94.52% in the train set). To mitigate label bias, we adjust the labels in the cross-entropy loss using the inverse ratio of their count in the train set, following Wang et al. (2021). Our best model uses a single layer of multi-headed attention in CT. We train our model for 15 epochs using a batch size of 1024 and a learning rate of 1e-5 on 4 NVIDIA Tesla v100 GPUs for a total training time of around 34 hours. |