TriBERT: Human-centric Audio-visual Representation Learning
Authors: Tanzila Rahman, Mengyu Yang, Leonid Sigal
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. In addition, we show that the learned Tri BERT representations are generic and significantly improve performance on other audio-visual tasks such as cross-modal audio-visual-pose retrieval by as much as 66.7% in top-1 accuracy. |
| Researcher Affiliation | Collaboration | 1University of British Columbia 2University of Toronto 3Vector Institute for AI 4Canada CIFAR AI Chair |
| Pseudocode | No | The paper describes the architecture and methods in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We will make our code and pretrained models publicly available upon acceptance of the paper. 2https://github.com/ubc-vision/Tri BERT |
| Open Datasets | Yes | We pre-train our model on the large MUSIC21 dataset and demonstrate improved performance in audio-visual sound source separation on that dataset as well as other datasets through fine-tuning. We use the MUSIC21 dataset [56] to train our network on two pretraining tasks: classification and sound source separation. For fine-tuning, we use the MUSIC dataset [55], which is a subset of MUSIC21, containing 685 untrimmed videos of musical solos and duets from 11 instrument classes. |
| Dataset Splits | Yes | We consider the MUSIC21 dataset [56], which contains 1365 untrimmed videos of musical solos and duets from 21 instrument classes for the initial training of our Tri BERT architecture. For fine-tuning, we use the MUSIC dataset [55], which is a subset of MUSIC21, containing 685 untrimmed videos of musical solos and duets from 11 instrument classes. As a result, for fair comparison, we trained our baselines [15, 55] with the available videos using an 80/20 train/test split. For the MUSIC dataset, we follow the experimental protocol from [42] and consider their reported results as our baselines. |
| Hardware Specification | Yes | We use the Adam optimizer with an initial learning rate of 1e 5 and batch size of 12 to train the network on 4 GTX 1080 GPUs for 6k epochs. |
| Software Dependencies | No | We used Py Torch to implement our network2. The paper mentions PyTorch but does not specify its version number or any other software dependencies with versions. |
| Experiment Setup | Yes | We use the Adam optimizer with an initial learning rate of 1e 5 and batch size of 12 to train the network on 4 GTX 1080 GPUs for 6k epochs. Training takes approximately 192 hours. We follow a fine-tuning strategy where we modify the classification layer from each pre-trained stream and then train our proposed model end-to-end with a learning rate of 1e 7 for 1500 epochs while keeping the rest of the hyper-parameters the same as the initial task. |