A Multimodal, Multi-Task Adapting Framework for Video Action Recognition
Authors: Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios. |
| Researcher Affiliation | Collaboration | 1Zhejiang University 2Youtu Lab,Tencent 3Technical University of Munich 4SGIT AI Lab, State Grid Corporation of China 5Baidu Inc |
| Pseudocode | No | The paper describes the architecture and components using figures (e.g., Fig. 3a, 3b) and mathematical formulations, but it does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code or a link to a code repository. |
| Open Datasets | Yes | We evaluate our M2-CLIP for supervised learning in two primary datasets: Kinetics-400 (K400) (Kay et al. 2017) and Something-Something-V2 (SSv2) (Goyal et al. 2017). For the generalization evaluation, we test our model on UCF101 (Soomro, Zamir, and Shah 2012) and HMDB51 (Kuehne et al. 2011). |
| Dataset Splits | No | The paper mentions using specific datasets (Kinetics-400, Something-Something-V2, UCF101, HMDB51) and a frame sampling strategy, but it does not explicitly provide details about train/validation/test dataset splits (e.g., percentages, sample counts, or a method for splitting). |
| Hardware Specification | No | The paper does not specify the hardware used to run the experiments, such as specific GPU or CPU models. |
| Software Dependencies | No | The paper does not provide specific version numbers for any software dependencies used in the experiments. |
| Experiment Setup | Yes | We employ Vi T-B/16 based CLIP as our backbone and use a sparse frame sampling strategy with 8, 16, or 32 frames during training and inference. |