Parameter Efficient Multimodal Transformers for Video Representation Learning

Authors: Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that our approach reduces parameters of the Transformers up to 97%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.
Researcher Affiliation Collaboration Sangho Lee, Youngjae Yu, Gunhee Kim Seoul National University {sangho.lee,yj.yu}@vision.snu.ac.kr, gunhee@snu.ac.kr Thomas Breuel, Jan Kautz NVIDIA Research {tbreuel,jkautz}@nvidia.com Yale Song Microsoft Research yalesong@microsoft.com
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets Yes To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks.
Dataset Splits No The paper uses well-known datasets like Kinetics-700 and Audio Set, but it does not explicitly provide the specific training, validation, and test dataset splits used for its experiments.
Hardware Specification Yes We pretrain our model on 64 NVIDIA Tesla V100 GPUs with a batch size of 256 for 220K iterations.
Software Dependencies No In all experiments, we use the AMSGrad (Reddi et al., 2018) variant of Adam W (Loshchilov & Hutter, 2019) optimizer with β1 = 0.9, β2 = 0.98, L2 weight decay of 1e-4. We augment audio data with random frequency/time masking using Spec Augment (Park et al., 2019).
Experiment Setup Yes We set the number of layers L = 6, the number of attention heads A = 12, the feature dimension D = 768 and the intermediate dimension E = 3072.