wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Authors: Adhiraj Banerjee, Vipul Arora

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show the consistent performance of wav2tok across two audio retrieval tasks: music search (query by humming) and speech search via audio query, outperforming state-of-the-art baselines.
Researcher Affiliation Academia Adhiraj Banerjee, Vipul Arora Department of Electrical Engineering Indian Institute of Technology Kanpur, India {adhiraj,vipular}@iitk.ac.in
Pseudocode No The paper describes the model architecture and training process using text and mathematical equations, but it does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes The codes are available in https://github.com/madhavlab/wav2tok.
Open Datasets Yes We use the TIMIT dataset which is composed of 6300 utterances of English speech with time-aligned word transcriptions. (Garofolo et al., 1993)
Dataset Splits Yes The training dataset has 1970 hums for training and 676 for validation.
Hardware Specification No The paper does not provide specific hardware details (like GPU or CPU models, or cloud instance types with specifications) used for running its experiments.
Software Dependencies No The paper mentions using the "sklearn library" and "fuzzysearch library" and the "ADAM optimizer", but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We train wav2tok using the ADAM (Kingma & Ba, 2017) optimizer and a linear learning schedule with a learning rate of 0.001 and 8% of the training steps as warm-up steps. The STFT matrices are computed with 513 frequency bins, a window length of 1024 samples (summarising 128 ms of audio), and hop length of 512 samples.