Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Listen, Think, and Understand
Authors: Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, James R. Glass
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTS |
| Researcher Affiliation | Collaboration | MIT CSAIL1 MIT-IBM Watson AI Lab2 |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, dataset, and pretrained models are available at https://github.com/yuangongnd/ltu. |
| Open Datasets | Yes | we relabel existing public datasets including Audio Set (including a 500K subset of the original 2M weakly-labeled release (Gemmeke et al., 2017) and the 100K subset with temporally-strong labels (Hershey et al., 2021)), VGGSound (Chen et al., 2020a), FSD50K (Fonseca et al., 2021), Audio Caps (Kim et al., 2019), Freesound (Font et al., 2013), Clotho v2 (Lipping et al., 2019), and Sound Bible (soundbible.com, 2006) as our training data. |
| Dataset Splits | Yes | For all these datasets, we only include data marked as training and validation samples and exclude any data marked as test or evaluation. In the evaluation set, each of the 15 acoustic scenes has 108 segments and the total number of evaluation samples is 1,620. |
| Hardware Specification | Yes | The model is trained on 4 RTX A6000 GPUs for about 3 days. |
| Software Dependencies | No | The paper mentions software components like 'LLa MA-7B', 'Vicuna', 'GPT-3.5-Turbo', and 'GPT-4', but does not provide specific version numbers for these or other libraries used for reproducibility. |
| Experiment Setup | Yes | In all training stages, we use a batch size of 256 and linear learning rate decay with warmup. We set the text token cutoff length to 108. Throughout this paper, we use a plain generation setting of Temperature=0.1, Top K=500, and Top P=0.95 with a repetition penalty of 1.1 (Fan et al., 2018; Keskar et al., 2019). |