Parameter Efficient Multimodal Transformers for Video Representation Learning
Authors: Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that our approach reduces parameters of the Transformers up to 97%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks. |
| Researcher Affiliation | Collaboration | Sangho Lee, Youngjae Yu, Gunhee Kim Seoul National University {sangho.lee,yj.yu}@vision.snu.ac.kr, gunhee@snu.ac.kr Thomas Breuel, Jan Kautz NVIDIA Research {tbreuel,jkautz}@nvidia.com Yale Song Microsoft Research yalesong@microsoft.com |
| Pseudocode | No | The paper does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements or links indicating the release of open-source code for the described methodology. |
| Open Datasets | Yes | To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classification tasks. |
| Dataset Splits | No | The paper uses well-known datasets like Kinetics-700 and Audio Set, but it does not explicitly provide the specific training, validation, and test dataset splits used for its experiments. |
| Hardware Specification | Yes | We pretrain our model on 64 NVIDIA Tesla V100 GPUs with a batch size of 256 for 220K iterations. |
| Software Dependencies | No | In all experiments, we use the AMSGrad (Reddi et al., 2018) variant of Adam W (Loshchilov & Hutter, 2019) optimizer with β1 = 0.9, β2 = 0.98, L2 weight decay of 1e-4. We augment audio data with random frequency/time masking using Spec Augment (Park et al., 2019). |
| Experiment Setup | Yes | We set the number of layers L = 6, the number of attention heads A = 12, the feature dimension D = 768 and the intermediate dimension E = 3072. |