Paxion: Patching Action Knowledge in Video-Language Foundation Models
Authors: Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Despite recent video-language models (Vid LM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, PAXION, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. Our extensive analyses show that PAXION and DVDM together effectively fill the gap in action knowledge understanding (~50% 80%), while maintaining or improving performance on a wide spectrum of both objectand action-centric downstream tasks. |
| Researcher Affiliation | Academia | Zhenhailong Wang1, Ansel Blume1, Sha Li1, Genglin Liu1, Jaemin Cho2, Zineng Tang2, Mohit Bansal2, Heng Ji1 1UIUC 2UNC {wangz3,hengji}@illinois.edu |
| Pseudocode | No | The paper describes the architecture and methods but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and data will be made publicly available for research purposes at https://github.com/Mike Wang WZHL/Paxion.git. |
| Open Datasets | Yes | We construct this benchmark by leveraging two existing open-domain video-language datasets, Ego4D [14] and Something-Something v2 (SSv2) [13], which provide fine-grained action annotations for each video clip. |
| Dataset Splits | Yes | The final statistics of the training and evaluation splits can be found in Table 5. For SSv2, since the test set does not provide label annotation, i.e., annotation with filled object names, we report scores on the validation set. For Ego4d, we evaluate on the test set. |
| Hardware Specification | Yes | We use two Nvidia Tesla V100 (16GB) GPUs for all experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies or libraries used in the experiments. |
| Experiment Setup | Yes | We use Adam W [34] optimizer with a learning rate of 1e-5 and a weight decay of 0.05. For the transformer variant, we use a batch size of 8 per GPU. For the Perceiver variant, we are able to increase the batch size to 32 per GPU due to the reduced computation complexity. |