Pengi: An Audio Language Model for Audio Tasks
Authors: Soham Deshmukh, Benjamin Elizalde, Rita Singh, Huaming Wang
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on 21 downstream tasks, our approach yields state-of-the-art performance in several of them. Our results show that connecting language models with audio models is a major step towards general-purpose audio understanding. |
| Researcher Affiliation | Collaboration | 1Microsoft 2Carnegie Mellon University {sdeshmukh, benjaminm, huawang}@microsoft.com, rsingh@cs.cmu.edu |
| Pseudocode | No | The paper describes the architecture and training process in text and diagrams, but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available here: https://github.com/microsoft/Pengi |
| Open Datasets | Yes | The training data is collected from multiple audio datasets coming from different sources. In all, we collected 3.4 million audio-text pairs and mapped them to the 8 templates. The number of training pairs makes this model one of the largest if not the largest non-speech audio model in literature. We use only the training set of each dataset. The datasets and their mapping to a task are the following. Sound Event Classification: Audio Set [21], FSD50K[20]; Acoustic Scene Classification: Cochl Scene [27]; Speech Emotion and Sentiment Recognition: MSP Podcast [38], CMU MOSI [60], CMU MOSEI [61], MELD [46]; Music Analysis: NSynth [17], FMA [9]; Audio Captioning: Audio Caps [30], Clotho V2 [13]; Audio Question and Answering: Clotho AQA [37]; Auxiliary: Wav Text5K [11], Sound Descs [33], MACS [40], Wav Caps [41], Free Sound [18] and Find Sound2. |
| Dataset Splits | Yes | Domain Dataset Files Dur. (secs) Output Type Metric Setup Audio Captioning Clotho 7k 15 30 Cap. SPIDEr train/val/test |
| Hardware Specification | Yes | We used Adam Optimiser [32] for 60 epochs and with a batch size of 384 on 20 V100 GPUs. |
| Software Dependencies | No | The paper mentions specific models and optimizers used (e.g., GPT2-base, Adam Optimiser, HTSAT, CLIP's text encoder, CLAP), but does not specify version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | We used Adam Optimiser [32] for 60 epochs and with a batch size of 384 on 20 V100 GPUs. We used a linear schedule with 2000 warmup steps and a base learning rate of 1e-4. |