ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Authors: Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-theart video models, whilst enjoying the advantage of parameter efficiency.
Researcher Affiliation Collaboration 1Multimedia Laboratory, The Chinese University of Hong Kong 2Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey 3Centre for Perceptual and Interactive Intelligence Limited
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code and model are available at https://github.com/linziyi96/st-adapter
Open Datasets Yes Datasets For the benchmark experiments, we use two popular video action recognition datasets. Kinetics-400 (K400): The K400 [33] dataset contains 240k training videos and 20k validation videos labeled with 400 action categories. Something-Something-v2 (SSv2): The SSv2 [22] dataset consists of 220,487 videos covering 174 human actions. Epic-Kitchens-100 (EK100): The EK100 [13] dataset consists of 100 hours of video in egocentric perspective recording a person interacting with a variety of objects in the kitchen.
Dataset Splits Yes Kinetics-400 (K400): The K400 [33] dataset contains 240k training videos and 20k validation videos labeled with 400 action categories.
Hardware Specification Yes 8 V100 GPUs
Software Dependencies No The paper mentions "Py Torch, Tensor Flow, Tensor RT, and Torch Script" as deep learning toolboxes but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes All details, including training and testing settings and module instantiation details, are provided in the appendix. (Section 4.1) We use one ST-Adapter with bottleneck width 384 before MHSA in each Transformer block. (Section 4.3 Ablations) All models are trained using 8 frames and tested with 3 views. (Table 1)