DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity Recognition
Authors: Rui Yan, Hongyu Qu, Xiangbo Shu, Wenbin Li, Jinhui Tang, Tieniu Tan
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We extensively evaluate the proposed framework on various benchmarks and quantitative and qualitative results show its substantial improvements and strong interpretability. ... We compare the proposed DTS-TPT with existing zero-shot video activity recognition methods on four benchmarks with 8 frames, as shown in Table 1. ... Our approach achieves state-of-the-art results on HMDB51, UCF101, Kinetices-600, and Activity Net in terms of top-1 accuracy, surpassing BIKE by a significant margin. ... To illustrate the effectiveness of each component of our approach, we conduct ablation experiments on HMDB51 and UCF101 datasets. |
| Researcher Affiliation | Academia | Rui Yan1 , Hongyu Qu2 , Xiangbo Shu2 , Wenbin Li1 , Jinhui Tang2 and Tieniu Tan1 1Nanjing University 2Nanjing University of Science and Technology {ruiyan, liwenbin, tnt}@nju.edu.cn, {quhongyu, shuxb, jinhuitang}@njust.edu.cn |
| Pseudocode | Yes | Algorithm 1 Pseudocode of DTS-TPT |
| Open Source Code | Yes | The code is available at https://github.com/quhongyu/DTS-TPT. |
| Open Datasets | Yes | HMDB-51 [Kuehne et al., 2011] ... UCF-101 [Soomro et al., 2012] ... Kinetics-600 [Carreira et al., 2018] ... Activity Net [Caba Heilbron et al., 2015] |
| Dataset Splits | Yes | Following [Ni et al., 2022; Rasheed et al., 2023; Lin et al., 2023], we report the mean and standard deviation of results on three official validation sets. |
| Hardware Specification | No | No specific hardware details (e.g., GPU models, CPU types, or memory amounts) were mentioned for running the experiments. |
| Software Dependencies | No | The paper mentions software components like 'CLIP with Vi T-B/16' and 'Adam W optimizer' and LLMs (e.g., 'GPT 3.5'), but does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | We sample T = 8 frames from test video and augment them K 1 (K = 32) times using Aug Mix [Hendrycks et al., 2019]. We create 3 learnable tokens as text prompts and initialize them as an action of . ... For each inference, we compute predictions based on a batch of 32 augmented views (including the original one) and then select top 20% confident predictions for further optimization. We adopt the Adam W optimizer with a learning rate of 0.001. |