Learning Temporal Resolution in Spectrogram for Audio Classification

Authors: Haohe Liu, Xubo Liu, Qiuqiang Kong, Wenwu Wang, Mark D. Plumbley

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Diff Res on five audio classification tasks, using mel-spectrograms as the acoustic features, followed by off-the-shelf classifier backbones. Compared with previous methods using the fixed temporal resolution, the Diff Res-based method can achieve the equivalent or better classification accuracy with at least 25% computational cost reduction. We extensively evaluate the effectiveness of Diff Res on five audio classification tasks.
Researcher Affiliation Academia Haohe Liu1, Xubo Liu1, Qiuqiang Kong2, Wenwu Wang1, Mark D. Plumbley1 1University of Surrey 2The Chinese University of Hong Kong
Pseudocode No The paper describes the algorithms and methods in textual form and through equations, but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/haoheliu/ diffres-python.
Open Datasets Yes We evaluate Diff Res on five different tasks and datasets including audio tagging on Audio Set (Gemmeke et al. 2017) and FSD50K (Fonseca et al. 2021), environmental sound classification on ESC50 (Piczak 2015), limited-vocabulary speech recognition on Speech Commands (Warden 2018), and music instrument classification on NSynth (Engel et al. 2017).
Dataset Splits Yes Following the evaluation protocol in the previous works (Zeghidour et al. 2021; Riad et al. 2021; Kong et al. 2020; Gong, Chung, and Glass 2021b), we report the mean average precision (m AP) as the main evaluation metric on Audio Set and FSD50K, and report classification accuracy (ACC) on other datasets. For the training data, we apply random spec-augmentation (Park et al. 2019) and mixup augmentation (Zhang et al. 2017) following Gong, Chung, and Glass (2021b). All experiments are repeated three times with different seeds to reduce randomness.
Hardware Specification Yes We use 128 filters in LEAF (Zeghidour et al. 2021) for a fair comparison with 128 mel-filterbanks in Mel and Diff Res. The computation time is measured between inputting waveform and outputting label prediction (with Efficient Net-B2). We use 128 filters in LEAF (Zeghidour et al. 2021) for a fair comparison with 128 mel-filterbanks in Mel and Diff Res. As shown in Figure 4, our proposed Diff Res only introduces marginal computational cost compared with Mel. The state-of-the-art learnable front-end, LEAF, is about four times slower than our proposed method. The majority of the cost in computation in LEAF comes from multiple complex-valued convolutions, which are computed in the time-domain with large kernels (e.g., 400) and a stride of one. Evaluated on a 2.6 GHz Intel Core i7 CPU.
Software Dependencies No The paper mentions software components like "Efficient Net B2" and "mel-spectrogram" calculation parameters (e.g., "Hanning window"), but it does not provide specific version numbers for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow).
Experiment Setup Yes In all experiments, we use the same architecture as used by Gong, Chung, and Glass (2021b), which is an Efficient Net B2 (Tan and Le 2019) with four attention heads (13.6 M parameters). We reload the Image Net pretrained weights for Efficient Net-B2 in a similar way to (Gong, Chung, and Glass 2021a,b). For the training data, we apply random spec-augmentation (Park et al. 2019) and mixup augmentation (Zhang et al. 2017) following Gong, Chung, and Glass (2021b). We train the Diff Res layer with λ = 0.5 and ϵ = 1 10 4. We calculate the mel-spectrogram with a Hanning window, 25 ms window length, 10 ms hop size, and 128 mel-filterbanks by default.