TriBERT: Human-centric Audio-visual Representation Learning

Authors: Tanzila Rahman, Mengyu Yang, Leonid Sigal

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned Tri BERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.
Researcher Affiliation Collaboration 1University of British Columbia 2University of Toronto 3Vector Institute for AI 4Canada CIFAR AI Chair
Pseudocode No The paper describes the architecture and methods in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes We will make our code and pretrained models publicly available upon acceptance of the paper. 2https://github.com/ubc-vision/Tri BERT
Open Datasets Yes We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. We use the MUSIC21 dataset [56] to train our network on two pretraining tasks: classification and sound source separation. For fine-tuning, we use the MUSIC dataset [55], which is a subset of MUSIC21, containing 685 untrimmed videos of musical solos and duets from 11 instrument classes.
Dataset Splits Yes We consider the MUSIC21 dataset [56], which contains 1365 untrimmed videos of musical solos and duets from 21 instrument classes for the initial training of our Tri BERT architecture. For fine-tuning, we use the MUSIC dataset [55], which is a subset of MUSIC21, containing 685 untrimmed videos of musical solos and duets from 11 instrument classes. As a result, for fair comparison, we trained our baselines [15, 55] with the available videos using an 80/20 train/test split. For the MUSIC dataset, we follow the experimental protocol from [42] and consider their reported results as our baselines.
Hardware Specification Yes We use the Adam optimizer with an initial learning rate of 1e 5 and batch size of 12 to train the network on 4 GTX 1080 GPUs for 6k epochs.
Software Dependencies No We used Py Torch to implement our network2. The paper mentions PyTorch but does not specify its version number or any other software dependencies with versions.
Experiment Setup Yes We use the Adam optimizer with an initial learning rate of 1e 5 and batch size of 12 to train the network on 4 GTX 1080 GPUs for 6k epochs. Training takes approximately 192 hours. We follow a fine-tuning strategy where we modify the classification layer from each pre-trained stream and then train our proposed model end-to-end with a learning rate of 1e 7 for 1500 epochs while keeping the rest of the hyper-parameters the same as the initial task.