reproducibilityindex.ai

DTS-TPT: Dual Temporal-Sync Test-time Prompt Tuning for Zero-shot Activity Recognition

Authors: Rui Yan, Hongyu Qu, Xiangbo Shu, Wenbin Li, Jinhui Tang, Tieniu Tan

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We extensively evaluate the proposed framework on various benchmarks and quantitative and qualitative results show its substantial improvements and strong interpretability. ... We compare the proposed DTS-TPT with existing zero-shot video activity recognition methods on four benchmarks with 8 frames, as shown in Table 1. ... Our approach achieves state-of-the-art results on HMDB51, UCF101, Kinetices-600, and Activity Net in terms of top-1 accuracy, surpassing BIKE by a significant margin. ... To illustrate the effectiveness of each component of our approach, we conduct ablation experiments on HMDB51 and UCF101 datasets.
Researcher Affiliation	Academia	Rui Yan1 , Hongyu Qu2 , Xiangbo Shu2 , Wenbin Li1 , Jinhui Tang2 and Tieniu Tan1 1Nanjing University 2Nanjing University of Science and Technology {ruiyan, liwenbin, tnt}@nju.edu.cn, {quhongyu, shuxb, jinhuitang}@njust.edu.cn
Pseudocode	Yes	Algorithm 1 Pseudocode of DTS-TPT
Open Source Code	Yes	The code is available at https://github.com/quhongyu/DTS-TPT.
Open Datasets	Yes	HMDB-51 [Kuehne et al., 2011] ... UCF-101 [Soomro et al., 2012] ... Kinetics-600 [Carreira et al., 2018] ... Activity Net [Caba Heilbron et al., 2015]
Dataset Splits	Yes	Following [Ni et al., 2022; Rasheed et al., 2023; Lin et al., 2023], we report the mean and standard deviation of results on three official validation sets.
Hardware Specification	No	No specific hardware details (e.g., GPU models, CPU types, or memory amounts) were mentioned for running the experiments.
Software Dependencies	No	The paper mentions software components like 'CLIP with Vi T-B/16' and 'Adam W optimizer' and LLMs (e.g., 'GPT 3.5'), but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	We sample T = 8 frames from test video and augment them K 1 (K = 32) times using Aug Mix [Hendrycks et al., 2019]. We create 3 learnable tokens as text prompts and initialize them as an action of . ... For each inference, we compute predictions based on a batch of 32 augmented views (including the original one) and then select top 20% confident predictions for further optimization. We adopt the Adam W optimizer with a learning rate of 0.001.