Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation
Authors: Manh Luong, Khai Nguyen, Nhat Ho, Gholamreza Haffari, Dinh Phung, Lizhen Qu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on audio-text matching problems using three datasets: Audio Caps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. |
| Researcher Affiliation | Academia | Manh Luong1, Khai Nguyen2, Nhat Ho2, Dinh Phung1, Gholamreza Haffari1, Lizhen Qu1 1Monash University, Australia, 2 University of Texas at Austin, USA |
| Pseudocode | Yes | We refer the reader to the training algorithm in the Algorithm. 1 in the Appendix A.2. Algorithm 1: Learning ground cost metric using m-LTM framework and Mahanalobis distance. Algorithm 2: Sinkhorn Algorithm. |
| Open Source Code | Yes | Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval |
| Open Datasets | Yes | We conduct extensive experiments on audio-text matching problems using three datasets: Audio Caps, Clotho, and ESC-50. Audio Caps (Kim et al., 2019), Clotho (Drossos et al., 2019), ESC-50 (Piczak, 2015). |
| Dataset Splits | Yes | Audio Caps is the biggest audio captioning dataset that consists of around 50k audio-caption pairs. There are a total of 40,582 audio clips in training data, and all audio clips are 10 seconds long. Each audio clip has a single human-annotated caption. The validation and test sets have 494 and 957 audio clips, respectively, and each audio clip has five ground-truth captions. |
| Hardware Specification | Yes | All experiments are performed on a single A100 GPU. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma & Ba, 2014)', 'Res Net-38 model (Kong et al., 2019)', 'HTSAT audio encoder (Chen et al., 2022)', and 'BERT model (Devlin et al., 2018)'. While these refer to specific algorithms or models, no specific software library names with their version numbers (e.g., 'PyTorch 1.9', 'TensorFlow 2.x') are provided for reproducibility beyond the initial reference to the Adam optimizer's paper. |
| Experiment Setup | Yes | All the models and the matrix M are trained for 30 epochs with Adam optimizer (Kingma & Ba, 2014). The hyperparameters for training are learning rate lr = 1 10 4, batch size b = 256, and dropout ratio p = 0.2. |