reproducibilityindex.ai

Paxion: Patching Action Knowledge in Video-Language Foundation Models

Authors: Zhenhailong Wang, Ansel Blume, Sha Li, Genglin Liu, Jaemin Cho, Zineng Tang, Mohit Bansal, Heng Ji

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Despite recent video-language models (Vid LM) impressive performance on various benchmark tasks, our diagnostic tasks reveal their surprising deficiency (near-random performance) in action knowledge, suggesting that current models rely on object recognition abilities as a shortcut for action understanding. To remedy this, we propose a novel framework, PAXION, along with a new Discriminative Video Dynamics Modeling (DVDM) objective. Our extensive analyses show that PAXION and DVDM together effectively fill the gap in action knowledge understanding (~50% 80%), while maintaining or improving performance on a wide spectrum of both objectand action-centric downstream tasks.
Researcher Affiliation	Academia	Zhenhailong Wang1, Ansel Blume1, Sha Li1, Genglin Liu1, Jaemin Cho2, Zineng Tang2, Mohit Bansal2, Heng Ji1 1UIUC 2UNC {wangz3,hengji}@illinois.edu
Pseudocode	No	The paper describes the architecture and methods but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	The code and data will be made publicly available for research purposes at https://github.com/Mike Wang WZHL/Paxion.git.
Open Datasets	Yes	We construct this benchmark by leveraging two existing open-domain video-language datasets, Ego4D [14] and Something-Something v2 (SSv2) [13], which provide fine-grained action annotations for each video clip.
Dataset Splits	Yes	The final statistics of the training and evaluation splits can be found in Table 5. For SSv2, since the test set does not provide label annotation, i.e., annotation with filled object names, we report scores on the validation set. For Ego4d, we evaluate on the test set.
Hardware Specification	Yes	We use two Nvidia Tesla V100 (16GB) GPUs for all experiments.
Software Dependencies	No	The paper does not provide specific version numbers for software dependencies or libraries used in the experiments.
Experiment Setup	Yes	We use Adam W [34] optimizer with a learning rate of 1e-5 and a weight decay of 0.05. For the transformer variant, we use a batch size of 8 per GPU. For the Perceiver variant, we are able to increase the batch size to 32 per GPU due to the reduced computation complexity.