wav2tok: Deep Sequence Tokenizer for Audio Retrieval
Authors: Adhiraj Banerjee, Vipul Arora
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show the consistent performance of wav2tok across two audio retrieval tasks: music search (query by humming) and speech search via audio query, outperforming state-of-the-art baselines. |
| Researcher Affiliation | Academia | Adhiraj Banerjee, Vipul Arora Department of Electrical Engineering Indian Institute of Technology Kanpur, India {adhiraj,vipular}@iitk.ac.in |
| Pseudocode | No | The paper describes the model architecture and training process using text and mathematical equations, but it does not contain structured pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | The codes are available in https://github.com/madhavlab/wav2tok. |
| Open Datasets | Yes | We use the TIMIT dataset which is composed of 6300 utterances of English speech with time-aligned word transcriptions. (Garofolo et al., 1993) |
| Dataset Splits | Yes | The training dataset has 1970 hums for training and 676 for validation. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU or CPU models, or cloud instance types with specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using the "sklearn library" and "fuzzysearch library" and the "ADAM optimizer", but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We train wav2tok using the ADAM (Kingma & Ba, 2017) optimizer and a linear learning schedule with a learning rate of 0.001 and 8% of the training steps as warm-up steps. The STFT matrices are computed with 513 frequency bins, a window length of 1024 samples (summarising 128 ms of audio), and hop length of 512 samples. |