Pengi: An Audio Language Model for Audio Tasks

Authors: Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When evaluated on 21 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding.
Researcher Affiliation Collaboration 1Microsoft 2Carnegie Mellon University {sdeshmukh, benjaminm, huawang}@microsoft.com, rsingh@cs.cmu.edu
Pseudocode No The paper describes the architecture and training process in text and diagrams, but does not include any explicit pseudocode or algorithm blocks.
Open Source Code Yes Code is available here: https://github.com/microsoft/Pengi
Open Datasets Yes The training data is collected from multiple audio datasets coming from different sources. In all, we collected 3.4 million audio-text pairs and mapped them to the 8 templates. The number of training pairs makes this model one of the largest if not the largest non-speech audio model in literature. We use only the training set of each dataset. The datasets and their mapping to a task are the following. Sound Event Classification: Audio Set [21], FSD50K[20]; Acoustic Scene Classification: Cochl Scene [27]; Speech Emotion and Sentiment Recognition: MSP Podcast [38], CMU MOSI [60], CMU MOSEI [61], MELD [46]; Music Analysis: NSynth [17], FMA [9]; Audio Captioning: Audio Caps [30], Clotho V2 [13]; Audio Question and Answering: Clotho AQA [37]; Auxiliary: Wav Text5K [11], Sound Descs [33], MACS [40], Wav Caps [41], Free Sound [18] and Find Sound2.
Dataset Splits Yes Domain Dataset Files Dur. (secs) Output Type Metric Setup Audio Captioning Clotho 7k 15 30 Cap. SPIDEr train/val/test
Hardware Specification Yes We used Adam Optimiser [32] for 60 epochs and with a batch size of 384 on 20 V100 GPUs.
Software Dependencies No The paper mentions specific models and optimizers used (e.g., GPT2-base, Adam Optimiser, HTSAT, CLIP's text encoder, CLAP), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We used Adam Optimiser [32] for 60 epochs and with a batch size of 384 on 20 V100 GPUs. We used a linear schedule with 2000 warmup steps and a base learning rate of 1e-4.