reproducibilityindex.ai

TriBERT: Human-centric Audio-visual Representation Learning

Authors: Tanzila Rahman, Mengyu Yang, Leonid Sigal

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through ﬁne-tuning. In addition, we show that the learned Tri BERT representations are generic and signiﬁcantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy.
Researcher Affiliation	Collaboration	1University of British Columbia 2University of Toronto 3Vector Institute for AI 4Canada CIFAR AI Chair
Pseudocode	No	The paper describes the architecture and methods in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We will make our code and pretrained models publicly available upon acceptance of the paper. 2https://github.com/ubc-vision/Tri BERT
Open Datasets	Yes	We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through ﬁne-tuning. We use the MUSIC21 dataset [56] to train our network on two pretraining tasks: classiﬁcation and sound source separation. For ﬁne-tuning, we use the MUSIC dataset [55], which is a subset of MUSIC21, containing 685 untrimmed videos of musical solos and duets from 11 instrument classes.
Dataset Splits	Yes	We consider the MUSIC21 dataset [56], which contains 1365 untrimmed videos of musical solos and duets from 21 instrument classes for the initial training of our Tri BERT architecture. For ﬁne-tuning, we use the MUSIC dataset [55], which is a subset of MUSIC21, containing 685 untrimmed videos of musical solos and duets from 11 instrument classes. As a result, for fair comparison, we trained our baselines [15, 55] with the available videos using an 80/20 train/test split. For the MUSIC dataset, we follow the experimental protocol from [42] and consider their reported results as our baselines.
Hardware Specification	Yes	We use the Adam optimizer with an initial learning rate of 1e 5 and batch size of 12 to train the network on 4 GTX 1080 GPUs for 6k epochs.
Software Dependencies	No	We used Py Torch to implement our network2. The paper mentions PyTorch but does not specify its version number or any other software dependencies with versions.
Experiment Setup	Yes	We use the Adam optimizer with an initial learning rate of 1e 5 and batch size of 12 to train the network on 4 GTX 1080 GPUs for 6k epochs. Training takes approximately 192 hours. We follow a ﬁne-tuning strategy where we modify the classiﬁcation layer from each pre-trained stream and then train our proposed model end-to-end with a learning rate of 1e 7 for 1500 epochs while keeping the rest of the hyper-parameters the same as the initial task.