Masked Autoencoders with Multi-Window Local-Global Attention Are Better Audio Learners
Authors: Sarthak Yadav, Sergios Theodoridis, Lars Kai Hansen, Zheng-Hua Tan
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results on ten downstream audio tasks show that MW-MAEs consistently outperform standard MAEs in overall performance and learn better general-purpose audio representations, along with demonstrating considerably better scaling characteristics. Investigating attention distances and entropies reveals that MW-MAE encoders learn heads with broader local and global attention. Analyzing attention head feature representations through Projection Weighted Canonical Correlation Analysis (PWCCA) shows that attention heads with the same window sizes across the decoder layers of the MW-MAE learn correlated feature representations which enables each block to independently capture local and global information, leading to a decoupled decoder feature hierarchy. |
| Researcher Affiliation | Academia | Sarthak Yadav1,2 Sergios Theodoridis1,4 Lars Kai Hansen2,3 Zheng-Hua Tan1,2 1Aalborg University 2Pioneer Centre for Artificial Intelligence, Denmark 3Technical University of Denmark 4University of Athens |
| Pseudocode | Yes | Figure 5: Pseudocode for Win Attention (Appendix A) |
| Open Source Code | Yes | The code for feature extraction and running downstream experiments for our default configurations as well as the corresponding pre-trained weights can be found at https://github.com/SarthakYadav/mwmae-jax-official. |
| Open Datasets | Yes | We use the full Audio Set dataset (Gemmeke et al., 2017) (AS-5k) for pre-training MAEs and MW-MAEs. With over 5000 hours of audio data distributed in 2 million 10-second weakly annotated You Tube clips spanning 527 classes, Audio Set is one of the largest publicly available audio corpora. |
| Dataset Splits | Yes | For evaluating audio representations, we utilize a subset of the HEAR benchmark which consists of ten diverse tasks spanning multiple domains. ... For downstream evaluation, we follow the HEAR protocol, where for each task, a shallow downstream classifier is trained on top of fixed features extracted using a pretrained model. ... MAESTRO-5h is evaluated in a 5-fold cross validation setting, with Onset FMS evaluation metric. |
| Hardware Specification | Yes | All MAEs are pre-trained for 100 epochs with a batch size of 1024 and a weight decay of 0.05 on a single TPU-v3 VM with 8 TPU cores... Accelerators 8x TPU-v3 cores 1 Nvidia-A40 |
| Software Dependencies | No | The paper mentions software tools like 'torchaudio', 'hear-eval-kit', and 'Speech Brain toolkit', but it does not specify their version numbers. |
| Experiment Setup | Yes | Pre-training: ...We use a high masking ratio (80%)...Our default configuration consists of a Vi T-B encoder...For our default configuration, our patch embedding computes non-overlapping patches with a patch size of (4 \u00d7 16)...a smaller 4-layer deep transformer-based decoder of width 384 and 8 attention heads for our default configuration. ...pre-trained for 100 epochs with a batch size of 1024 and a weight decay of 0.05...warm up for ten epochs to a base learning rate of 1e-5, followed by a cosine decay schedule. A masking ratio of 0.8 with unstructured random masking is used... |