reproducibilityindex.ai

Grounded Video Situation Recognition

Authors: Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When evaluated on a grounding-augmented version of the Vid Situ dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time. 4 Experiments We evaluate our model in two main settings.
Researcher Affiliation	Academia	Zeeshan Khan C. V. Jawahar Makarand Tapaswi CVIT, IIIT Hyderabad
Pseudocode	No	The paper describes the model architecture and steps in natural language and with diagrams, but does not include formal pseudocode or algorithm blocks.
Open Source Code	No	More examples on our project page, https://zeeshank95.github.io/grvidsitu/GVSR.html.
Open Datasets	Yes	We evaluate our model on the Vid Situ [27] dataset that consists of 29k videos (23.6k train, 1.3k val, and others in task-specific test sets) collected from a diverse set of 3k movies.
Dataset Splits	Yes	We evaluate our model on the Vid Situ [27] dataset that consists of 29k videos (23.6k train, 1.3k val, and others in task-specific test sets) collected from a diverse set of 3k movies.
Hardware Specification	Yes	As we use pretrained features, we train our model on a single RTX-2080 GPU, batch size of 16.
Software Dependencies	No	We implement our model in Pytorch [22]. We use the Adam optimizer [12] with a learning rate of 10^-4 to train the whole model end-to-end.
Experiment Setup	Yes	All the three Transformers have the same configurations they have 3 layers with 8 attention heads, and hidden dimension 1024. We use the Adam optimizer [12] with a learning rate of 10^-4 to train the whole model end-to-end. As we use pretrained features, we train our model on a single RTX-2080 GPU, batch size of 16.