ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
Authors: Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-theart video models, whilst enjoying the advantage of parameter efficiency. |
| Researcher Affiliation | Collaboration | 1Multimedia Laboratory, The Chinese University of Hong Kong 2Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey 3Centre for Perceptual and Interactive Intelligence Limited |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and model are available at https://github.com/linziyi96/st-adapter |
| Open Datasets | Yes | Datasets For the benchmark experiments, we use two popular video action recognition datasets. Kinetics-400 (K400): The K400 [33] dataset contains 240k training videos and 20k validation videos labeled with 400 action categories. Something-Something-v2 (SSv2): The SSv2 [22] dataset consists of 220,487 videos covering 174 human actions. Epic-Kitchens-100 (EK100): The EK100 [13] dataset consists of 100 hours of video in egocentric perspective recording a person interacting with a variety of objects in the kitchen. |
| Dataset Splits | Yes | Kinetics-400 (K400): The K400 [33] dataset contains 240k training videos and 20k validation videos labeled with 400 action categories. |
| Hardware Specification | Yes | 8 V100 GPUs |
| Software Dependencies | No | The paper mentions "Py Torch, Tensor Flow, Tensor RT, and Torch Script" as deep learning toolboxes but does not provide specific version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | All details, including training and testing settings and module instantiation details, are provided in the appendix. (Section 4.1) We use one ST-Adapter with bottleneck width 384 before MHSA in each Transformer block. (Section 4.3 Ablations) All models are trained using 8 frames and tested with 3 views. (Table 1) |