Grounded Video Situation Recognition

Authors: Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When evaluated on a grounding-augmented version of the Vid Situ dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time. 4 Experiments We evaluate our model in two main settings.
Researcher Affiliation Academia Zeeshan Khan C. V. Jawahar Makarand Tapaswi CVIT, IIIT Hyderabad
Pseudocode No The paper describes the model architecture and steps in natural language and with diagrams, but does not include formal pseudocode or algorithm blocks.
Open Source Code No More examples on our project page, https://zeeshank95.github.io/grvidsitu/GVSR.html.
Open Datasets Yes We evaluate our model on the Vid Situ [27] dataset that consists of 29k videos (23.6k train, 1.3k val, and others in task-specific test sets) collected from a diverse set of 3k movies.
Dataset Splits Yes We evaluate our model on the Vid Situ [27] dataset that consists of 29k videos (23.6k train, 1.3k val, and others in task-specific test sets) collected from a diverse set of 3k movies.
Hardware Specification Yes As we use pretrained features, we train our model on a single RTX-2080 GPU, batch size of 16.
Software Dependencies No We implement our model in Pytorch [22]. We use the Adam optimizer [12] with a learning rate of 10^-4 to train the whole model end-to-end.
Experiment Setup Yes All the three Transformers have the same configurations they have 3 layers with 8 attention heads, and hidden dimension 1024. We use the Adam optimizer [12] with a learning rate of 10^-4 to train the whole model end-to-end. As we use pretrained features, we train our model on a single RTX-2080 GPU, batch size of 16.