Listen, Think, and Understand
Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James R. Glass
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS |
| Researcher Affiliation | Collaboration | MIT CSAIL1 MIT-IBM Watson AI Lab2 |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, dataset, and pretrained models are available at https://github.com/yuangongnd/ltu. |
| Open Datasets | Yes | we relabel existing public datasets including Audio Set (including a 500K subset of the original 2M weakly-labeled release (Gemmeke et al., 2017) and the 100K subset with temporally-strong labels (Hershey et al., 2021)), VGGSound (Chen et al., 2020a), FSD50K (Fonseca et al., 2021), Audio Caps (Kim et al., 2019), Freesound (Font et al., 2013), Clotho v2 (Lipping et al., 2019), and Sound Bible (soundbible.com, 2006) as our training data. |
| Dataset Splits | Yes | For all these datasets, we only include data marked as training and validation samples and exclude any data marked as test or evaluation. In the evaluation set, each of the 15 acoustic scenes has 108 segments and the total number of evaluation samples is 1,620. |
| Hardware Specification | Yes | The model is trained on 4 RTX A6000 GPUs for about 3 days. |
| Software Dependencies | No | The paper mentions software components like 'LLa MA-7B', 'Vicuna', 'GPT-3.5-Turbo', and 'GPT-4', but does not provide specific version numbers for these or other libraries used for reproducibility. |
| Experiment Setup | Yes | In all training stages, we use a batch size of 256 and linear learning rate decay with warmup. We set the text token cutoff length to 108. Throughout this paper, we use a plain generation setting of Temperature=0.1, Top K=500, and Top P=0.95 with a repetition penalty of 1.1 (Fan et al., 2018; Keskar et al., 2019). |