reproducibilityindex.ai

Parameter Efficient Multimodal Transformers for Video Representation Learning

Authors: Sangho Lee, Youngjae Yu, Gunhee Kim, Thomas Breuel, Jan Kautz, Yale Song

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that our approach reduces parameters of the Transformers up to 97%, allowing us to train our model end-to-end from scratch. We also propose a negative sampling approach based on an instance similarity measured on the CNN embedding space that our model learns together with the Transformers. To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classiﬁcation tasks.
Researcher Affiliation	Collaboration	Sangho Lee, Youngjae Yu, Gunhee Kim Seoul National University {sangho.lee,yj.yu}@vision.snu.ac.kr, gunhee@snu.ac.kr Thomas Breuel, Jan Kautz NVIDIA Research {tbreuel,jkautz}@nvidia.com Yale Song Microsoft Research yalesong@microsoft.com
Pseudocode	No	The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statements or links indicating the release of open-source code for the described methodology.
Open Datasets	Yes	To demonstrate our approach, we pretrain our model on 30-second clips (480 frames) from Kinetics-700 and transfer it to audio-visual classiﬁcation tasks.
Dataset Splits	No	The paper uses well-known datasets like Kinetics-700 and Audio Set, but it does not explicitly provide the specific training, validation, and test dataset splits used for its experiments.
Hardware Specification	Yes	We pretrain our model on 64 NVIDIA Tesla V100 GPUs with a batch size of 256 for 220K iterations.
Software Dependencies	No	In all experiments, we use the AMSGrad (Reddi et al., 2018) variant of Adam W (Loshchilov & Hutter, 2019) optimizer with β1 = 0.9, β2 = 0.98, L2 weight decay of 1e-4. We augment audio data with random frequency/time masking using Spec Augment (Park et al., 2019).
Experiment Setup	Yes	We set the number of layers L = 6, the number of attention heads A = 12, the feature dimension D = 768 and the intermediate dimension E = 3072.