reproducibilityindex.ai

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning

Authors: Junting Pan, Ziyi Lin, Xiatian Zhu, Jing Shao, Hongsheng Li

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-theart video models, whilst enjoying the advantage of parameter efficiency.
Researcher Affiliation	Collaboration	1Multimedia Laboratory, The Chinese University of Hong Kong 2Surrey Institute for People-Centred Artificial Intelligence, CVSSP, University of Surrey 3Centre for Perceptual and Interactive Intelligence Limited
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and model are available at https://github.com/linziyi96/st-adapter
Open Datasets	Yes	Datasets For the benchmark experiments, we use two popular video action recognition datasets. Kinetics-400 (K400): The K400 [33] dataset contains 240k training videos and 20k validation videos labeled with 400 action categories. Something-Something-v2 (SSv2): The SSv2 [22] dataset consists of 220,487 videos covering 174 human actions. Epic-Kitchens-100 (EK100): The EK100 [13] dataset consists of 100 hours of video in egocentric perspective recording a person interacting with a variety of objects in the kitchen.
Dataset Splits	Yes	Kinetics-400 (K400): The K400 [33] dataset contains 240k training videos and 20k validation videos labeled with 400 action categories.
Hardware Specification	Yes	8 V100 GPUs
Software Dependencies	No	The paper mentions "Py Torch, Tensor Flow, Tensor RT, and Torch Script" as deep learning toolboxes but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup	Yes	All details, including training and testing settings and module instantiation details, are provided in the appendix. (Section 4.1) We use one ST-Adapter with bottleneck width 384 before MHSA in each Transformer block. (Section 4.3 Ablations) All models are trained using 8 frames and tested with 3 views. (Table 1)