reproducibilityindex.ai

LEAF: A Learnable Frontend for Audio Classification

Authors: Neil Zeghidour, Olivier Teboul, Félix de Chaumont Quitry, Marco Tagliasacchi

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our frontend on three supervised learning problems: i) single-task classiﬁcation; ii) multi-task classiﬁcation and iii) multi-label classiﬁcation on Audioset. Table 1 reports the results for each task, with 95% conﬁdence intervals representing the uncertainty due to the limited test sample size.
Researcher Affiliation	Industry	Neil Zeghidour, Olivier Teboul, F elix de Chaumont Quitry & Marco Tagliasacchi Google Research {neilz, oliviert, fcq, mtagliasacchi}@google.com
Pseudocode	No	The paper provides mathematical equations (e.g., Equation 1 for optimization, Equations 2-6 for component operations) and describes the architecture of LEAF. However, it does not include any sections, figures, or blocks explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	Yes	Our code is publicly available1. 1https://github.com/google-research/leaf-audio
Open Datasets	Yes	We train independent single-task supervised models on 8 distinct classiﬁcation problems: acoustic scene classiﬁcation on TUT (Heittola et al., 2018), birdsong detection (Stowell et al., 2018), emotion recognition on Crema-D (Cao et al., 2014), speaker identiﬁcation on Vox Celeb (Nagrani et al., 2017), musical instrument and pitch detection on NSynth (Engel et al., 2017), keyword spotting on Speech Commands (Warden, 2018), and language identiﬁcation on Vox Forge (Revay & Tesch, 2019). A summary of the datasets used in our experiments is illustrated in Table A.1.
Dataset Splits	Yes	Table A.1: Datasets used in the experiments. Default train/test splits are always adopted. ... The input signal sampled at Fs = 16 k Hz is passed through the frontend which feeds into the convolutional encoder. ... To address the variable length of the input sequences, we train on randomly sampled 1 second windows. We train with ADAM (Kingma & Ba, 2014) and a learning rate of 10 4 for 1 M batches, with batch size 256. For Audioset experiments, we train with mixup (Zhang et al., 2017) and Spec Augment (Park et al., 2019).
Hardware Specification	No	The paper mentions the use of 'Efﬁcient Net B0' and 'CNN14' as convolutional encoders and specifies certain parameters for training (e.g., batch size). However, it does not provide any specific details about the hardware used for these experiments, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions using 'ADAM (Kingma & Ba, 2014)' as the optimizer and 'mixup (Zhang et al., 2017)' and 'Spec Augment (Park et al., 2019)' for Audioset experiments. However, it does not specify version numbers for these or any other software dependencies (e.g., Python, TensorFlow, PyTorch, CUDA versions) that would be needed for reproducibility.
Experiment Setup	Yes	The input signal sampled at Fs = 16 k Hz is passed through the frontend which feeds into the convolutional encoder. As baseline, we use a log-compressed mel-ﬁlterbank with 40 channels, computed over windows of 25 ms with a stride of 10 ms. For a fair comparison, both LEAF and the learnable baselines also have N = 40 ﬁlters, each with W = 401 coefﬁcients ( 25 ms at 16 k Hz). The learnable pooling is computed over 401 samples with a stride of 160 samples (10 ms at 16 k Hz), giving the same output dimension as mel-ﬁlterbanks. ... To address the variable length of the input sequences, we train on randomly sampled 1 second windows. We train with ADAM (Kingma & Ba, 2014) and a learning rate of 10 4 for 1 M batches, with batch size 256.