Grounded Video Situation Recognition
Authors: Zeeshan Khan, C.V. Jawahar, Makarand Tapaswi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on a grounding-augmented version of the Vid Situ dataset, we observe a large improvement in entity captioning accuracy, as well as the ability to localize verb-roles without grounding annotations at training time. 4 Experiments We evaluate our model in two main settings. |
| Researcher Affiliation | Academia | Zeeshan Khan C. V. Jawahar Makarand Tapaswi CVIT, IIIT Hyderabad |
| Pseudocode | No | The paper describes the model architecture and steps in natural language and with diagrams, but does not include formal pseudocode or algorithm blocks. |
| Open Source Code | No | More examples on our project page, https://zeeshank95.github.io/grvidsitu/GVSR.html. |
| Open Datasets | Yes | We evaluate our model on the Vid Situ [27] dataset that consists of 29k videos (23.6k train, 1.3k val, and others in task-specific test sets) collected from a diverse set of 3k movies. |
| Dataset Splits | Yes | We evaluate our model on the Vid Situ [27] dataset that consists of 29k videos (23.6k train, 1.3k val, and others in task-specific test sets) collected from a diverse set of 3k movies. |
| Hardware Specification | Yes | As we use pretrained features, we train our model on a single RTX-2080 GPU, batch size of 16. |
| Software Dependencies | No | We implement our model in Pytorch [22]. We use the Adam optimizer [12] with a learning rate of 10^-4 to train the whole model end-to-end. |
| Experiment Setup | Yes | All the three Transformers have the same configurations they have 3 layers with 8 attention heads, and hidden dimension 1024. We use the Adam optimizer [12] with a learning rate of 10^-4 to train the whole model end-to-end. As we use pretrained features, we train our model on a single RTX-2080 GPU, batch size of 16. |