LEAF: A Learnable Frontend for Audio Classification
Authors: Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our frontend on three supervised learning problems: i) single-task classification; ii) multi-task classification and iii) multi-label classification on Audioset. Table 1 reports the results for each task, with 95% confidence intervals representing the uncertainty due to the limited test sample size. |
| Researcher Affiliation | Industry | Neil Zeghidour, Olivier Teboul, F elix de Chaumont Quitry & Marco Tagliasacchi Google Research {neilz, oliviert, fcq, mtagliasacchi}@google.com |
| Pseudocode | No | The paper provides mathematical equations (e.g., Equation 1 for optimization, Equations 2-6 for component operations) and describes the architecture of LEAF. However, it does not include any sections, figures, or blocks explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Our code is publicly available1. 1https://github.com/google-research/leaf-audio |
| Open Datasets | Yes | We train independent single-task supervised models on 8 distinct classification problems: acoustic scene classification on TUT (Heittola et al., 2018), birdsong detection (Stowell et al., 2018), emotion recognition on Crema-D (Cao et al., 2014), speaker identification on Vox Celeb (Nagrani et al., 2017), musical instrument and pitch detection on NSynth (Engel et al., 2017), keyword spotting on Speech Commands (Warden, 2018), and language identification on Vox Forge (Revay & Tesch, 2019). A summary of the datasets used in our experiments is illustrated in Table A.1. |
| Dataset Splits | Yes | Table A.1: Datasets used in the experiments. Default train/test splits are always adopted. ... The input signal sampled at Fs = 16 k Hz is passed through the frontend which feeds into the convolutional encoder. ... To address the variable length of the input sequences, we train on randomly sampled 1 second windows. We train with ADAM (Kingma & Ba, 2014) and a learning rate of 10 4 for 1 M batches, with batch size 256. For Audioset experiments, we train with mixup (Zhang et al., 2017) and Spec Augment (Park et al., 2019). |
| Hardware Specification | No | The paper mentions the use of 'Efficient Net B0' and 'CNN14' as convolutional encoders and specifies certain parameters for training (e.g., batch size). However, it does not provide any specific details about the hardware used for these experiments, such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper mentions using 'ADAM (Kingma & Ba, 2014)' as the optimizer and 'mixup (Zhang et al., 2017)' and 'Spec Augment (Park et al., 2019)' for Audioset experiments. However, it does not specify version numbers for these or any other software dependencies (e.g., Python, TensorFlow, PyTorch, CUDA versions) that would be needed for reproducibility. |
| Experiment Setup | Yes | The input signal sampled at Fs = 16 k Hz is passed through the frontend which feeds into the convolutional encoder. As baseline, we use a log-compressed mel-filterbank with 40 channels, computed over windows of 25 ms with a stride of 10 ms. For a fair comparison, both LEAF and the learnable baselines also have N = 40 filters, each with W = 401 coefficients ( 25 ms at 16 k Hz). The learnable pooling is computed over 401 samples with a stride of 160 samples (10 ms at 16 k Hz), giving the same output dimension as mel-filterbanks. ... To address the variable length of the input sequences, we train on randomly sampled 1 second windows. We train with ADAM (Kingma & Ba, 2014) and a learning rate of 10 4 for 1 M batches, with batch size 256. |