reproducibilityindex.ai

Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation

Authors: Manh Luong, Khai Nguyen, Nhat Ho, Gholamreza Haffari, Dinh Phung, Lizhen Qu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on audio-text matching problems using three datasets: Audio Caps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance.
Researcher Affiliation	Academia	Manh Luong1, Khai Nguyen2, Nhat Ho2, Dinh Phung1, Gholamreza Haffari1, Lizhen Qu1 1Monash University, Australia, 2 University of Texas at Austin, USA
Pseudocode	Yes	We refer the reader to the training algorithm in the Algorithm. 1 in the Appendix A.2. Algorithm 1: Learning ground cost metric using m-LTM framework and Mahanalobis distance. Algorithm 2: Sinkhorn Algorithm.
Open Source Code	Yes	Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
Open Datasets	Yes	We conduct extensive experiments on audio-text matching problems using three datasets: Audio Caps, Clotho, and ESC-50. Audio Caps (Kim et al., 2019), Clotho (Drossos et al., 2019), ESC-50 (Piczak, 2015).
Dataset Splits	Yes	Audio Caps is the biggest audio captioning dataset that consists of around 50k audio-caption pairs. There are a total of 40,582 audio clips in training data, and all audio clips are 10 seconds long. Each audio clip has a single human-annotated caption. The validation and test sets have 494 and 957 audio clips, respectively, and each audio clip has five ground-truth captions.
Hardware Specification	Yes	All experiments are performed on a single A100 GPU.
Software Dependencies	No	The paper mentions using 'Adam optimizer (Kingma & Ba, 2014)', 'Res Net-38 model (Kong et al., 2019)', 'HTSAT audio encoder (Chen et al., 2022)', and 'BERT model (Devlin et al., 2018)'. While these refer to specific algorithms or models, no specific software library names with their version numbers (e.g., 'PyTorch 1.9', 'TensorFlow 2.x') are provided for reproducibility beyond the initial reference to the Adam optimizer's paper.
Experiment Setup	Yes	All the models and the matrix M are trained for 30 epochs with Adam optimizer (Kingma & Ba, 2014). The hyperparameters for training are learning rate lr = 1 10 4, batch size b = 256, and dropout ratio p = 0.2.