DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification
Authors: Tony Alex, Sara Ahmed, Armin Mustafa, Muhammad Awais, Philip JB Jackson
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. In this section, we evaluate the proposed architecture in various benchmark audio datasets, followed by ablation experiments to assess the various choices made during network development. |
| Researcher Affiliation | Academia | Tony Alex1, Sara Ahmed1, Armin Mustafa1, Muhammad Awais1, Philip JB Jackson1,2 1Surrey Institute for People-Centred AI, University of Surrey, Guildford, GU2 7XH, UK 2Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | Yes | The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git |
| Open Datasets | Yes | Dataset. Audio Set (Gemmeke et al. 2017) consists of audio files downloaded from You Tube. There are 3 subsets with 527 labels commonly used in experiments namely full set ( ~2M audio files), balanced set ( ~20k), and evaluation set ( ~20k). ESC50 (Piczak 2015) is a collection of 2000, 5-second audio files with 35 classes. The Speech Commands V2 (Warden 2018) consists of 84, 843 audio files in the train set, 9, 981 files in the validation set, and 11, 005 files in the evaluation set, each containing a spoken word of duration 1 second and spanning 35 classes. |
| Dataset Splits | Yes | For ESC50, the data is divided into 5 folds. The model is then trained five times, each time using a different fold as the evaluation set and the remaining four folds as the training set. The Speech Commands V2 consists of 84, 843 audio files in the train set, 9, 981 files in the validation set, and 11, 005 files in the evaluation set |
| Hardware Specification | Yes | For Audio Set experiments, we employed a single Nvidia A100-80GB GPU, while for ESC50 and Speech Commands V2, we utilised one NVIDIA Ge Force RTX 3090-24GB GPU, running on the Ubuntu OS and employing the Py Torch deep learning framework. |
| Software Dependencies | No | The paper mentions "running on the Ubuntu OS and employing the Py Torch deep learning framework" but does not provide specific version numbers for PyTorch or other key software dependencies. |
| Experiment Setup | Yes | We converted audio files of 10 seconds duration and 32k Hz sample rate to 128-dimensional Mel filterbank (fbank) features resulting in an input shape of 1024 128. ... We used an initial learning rate of 5e 4 for both full set and balanced set. The learning rate is updated by multiplying with a factor of 0.5 for the full set and 0.1 for the balanced set at certain validation steps using Multi-step learning rate scheduler. The models are trained with Adam W optimiser (Loshchilov and Hutter 2019) and binary cross entropy loss function. ... For the balanced set, the models are trained for 50 epochs using 32 batch size. As for the full Audio Set dataset, the models are trained with batch size of 64 for 10 and 8 epochs in the aforementioned settings, respectively. |