Improve Video Representation with Temporal Adversarial Augmentation

Authors: Jinhao Duan, Quanfu Fan, Hao Cheng, Xiaoshuang Shi, Kaidi Xu

IJCAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate TAF with four powerful models (TSM, GST, TAM, and TPN) over three challenging temporal-related benchmarks (Something-something V1&V2 and diving48). Experimental results demonstrate that TAF effectively improves the test accuracy of these models with notable margins without introducing additional parameters or computational costs.
Researcher Affiliation Collaboration Jinhao Duan1 , Quanfu Fan2 , Hao Cheng3 , Xiaoshuang Shi4 and Kaidi Xu1 1Drexel University 2Amazon 3The Hong Kong University of Science and Technology (Guangzhou) 4University of Electronic Science and Technology of China
Pseudocode Yes The pseudo-code of TAF is shown in Appendix A.
Open Source Code Yes Code is available at https://github.com/jinhaoduan/TAF.
Open Datasets Yes We evaluate TAF on three popular temporal datasets: Something-something V1&V2 [Goyal et al., 2017], Diving48 [Li et al., 2018b].
Dataset Splits No The paper mentions 'top-1 training accuracy vs top-1 validation accuracy' and uses pre-trained models with their initial training settings, implying standard splits. However, it does not explicitly state the specific percentages or sample counts for the training, validation, and test splits used in this paper's experiments.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as GPU models, CPU types, or memory configurations. It only mentions 'computational overheads' generally.
Software Dependencies No The paper does not specify the version numbers for any software dependencies, such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes For fine-tuning, we load pre-trained weights and keep training 15 epochs with TAF. We conduct 3 trials for each experiment and report the mean results. The initial training settings (e.g., learning rate, batch size, dropout, etc.) are the same as the status when the pre-trained models are logged. The learning rates are decayed by a factor of 10 after 10 epochs. We set α as 0.7, and the number of attacked frames N as 8 or 16 according to the input temporal length. All the performances reported in this paper are evaluated on 1 center crop and 1 clip, with input resolution 224 224.