DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity Recognition

Authors: Rui Yan, Hongyu Qu, Xiangbo Shu, Wenbin Li, Jinhui Tang, Tieniu Tan

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We extensively evaluate the proposed framework on various benchmarks and quantitative and qualitative results show its substantial improvements and strong interpretability. ... We compare the proposed DTS-TPT with existing zero-shot video activity recognition methods on four benchmarks with 8 frames, as shown in Table 1. ... Our approach achieves state-of-the-art results on HMDB51, UCF101, Kinetices-600, and Activity Net in terms of top-1 accuracy, surpassing BIKE by a significant margin. ... To illustrate the effectiveness of each component of our approach, we conduct ablation experiments on HMDB51 and UCF101 datasets.
Researcher Affiliation Academia Rui Yan1 , Hongyu Qu2 , Xiangbo Shu2 , Wenbin Li1 , Jinhui Tang2 and Tieniu Tan1 1Nanjing University 2Nanjing University of Science and Technology {ruiyan, liwenbin, tnt}@nju.edu.cn, {quhongyu, shuxb, jinhuitang}@njust.edu.cn
Pseudocode Yes Algorithm 1 Pseudocode of DTS-TPT
Open Source Code Yes The code is available at https://github.com/quhongyu/DTS-TPT.
Open Datasets Yes HMDB-51 [Kuehne et al., 2011] ... UCF-101 [Soomro et al., 2012] ... Kinetics-600 [Carreira et al., 2018] ... Activity Net [Caba Heilbron et al., 2015]
Dataset Splits Yes Following [Ni et al., 2022; Rasheed et al., 2023; Lin et al., 2023], we report the mean and standard deviation of results on three official validation sets.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, or memory amounts) were mentioned for running the experiments.
Software Dependencies No The paper mentions software components like 'CLIP with Vi T-B/16' and 'Adam W optimizer' and LLMs (e.g., 'GPT 3.5'), but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We sample T = 8 frames from test video and augment them K 1 (K = 32) times using Aug Mix [Hendrycks et al., 2019]. We create 3 learnable tokens as text prompts and initialize them as an action of . ... For each inference, we compute predictions based on a batch of 32 augmented views (including the original one) and then select top 20% confident predictions for further optimization. We adopt the Adam W optimizer with a learning rate of 0.001.