Boosting the Transferability of Video Adversarial Examples via Temporal Translation

Authors: Zhipeng Wei, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang2659-2667

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on the Kinetics-400 dataset and the UCF-101 dataset demonstrate that our method can significantly boost the transferability of video adversarial examples. For transfer-based attack against video recognition models, it achieves a 61.56% average attack success rate on the Kinetics-400 and 48.60% on the UCF-101.
Researcher Affiliation Academia 1Shanghai Key Lab of Intelligent Information Processing, School of Computer Science, Fudan University 2Shanghai Collaborative Innovation Center on Intelligent Visual Computing zpwei21@m.fudan.edu.cn, {chenjingjing, zxwu, ygj}@fudan.edu.cn
Pseudocode Yes Algorithm 1: Temporal translation (TT) attack Input: Loss function J, clean video x, ground-truth class y. Parameter: The perturbation budget ϵ, iteration number I, shift L, weight matrix W. Output: The adversarial example. 1: x0 x 2: α ϵ I 3: for i = 0 to I 1 do 4: xi+1 = clipx,ϵ(xi + α g) 5: end for 6: return x I
Open Source Code Yes Code is available at https://github.com/zhipeng-wei/TT.
Open Datasets Yes We evaluate our approach using UCF-101 (Soomro, Zamir, and Shah 2012) and Kinetics-400 datasets (Kay et al. 2017), which are widely used datasets for video recognition.
Dataset Splits No The paper mentions using 'the Kinetics-400 validation dataset' for evaluation, but it does not specify the explicit proportions (e.g., percentages or counts) of the training, validation, and test splits for the datasets (UCF-101 and Kinetics-400) used to train the models, nor does it cite a standard split.
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running its experiments. It only mentions that models were 'trained on the RGB domain'.
Software Dependencies No The paper mentions that the models used are 'implemented in https://cv.gluon.ai/model zoo/ action recognition.html', implying the use of GluonCV. However, it does not provide specific version numbers for any software dependencies, such as Python, PyTorch/MXNet, or GluonCV itself.
Experiment Setup Yes In our experiments, video recognition models with Res Net-101 as its backbone are used as whitebox models for adversarial example generation. We set the maximum perturbation as ϵ = 16 for all experiments. For the iterative attack, we set the iteration number to I = 10, and thus the step size α = 1.6. For our method, the shift length L is set as 7, the weight matrix W is generated with Gaussian function, and the adjacent shift is adopted in the temporal translation. Input clips are formed by randomly cropping out 64 consecutive frames from videos and then skipping every other frame. The spatial size of the input is 224 × 224.