reproducibilityindex.ai

wav2tok: Deep Sequence Tokenizer for Audio Retrieval

Authors: Adhiraj Banerjee, Vipul Arora

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show the consistent performance of wav2tok across two audio retrieval tasks: music search (query by humming) and speech search via audio query, outperforming state-of-the-art baselines.
Researcher Affiliation	Academia	Adhiraj Banerjee, Vipul Arora Department of Electrical Engineering Indian Institute of Technology Kanpur, India {adhiraj,vipular}@iitk.ac.in
Pseudocode	No	The paper describes the model architecture and training process using text and mathematical equations, but it does not contain structured pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	The codes are available in https://github.com/madhavlab/wav2tok.
Open Datasets	Yes	We use the TIMIT dataset which is composed of 6300 utterances of English speech with time-aligned word transcriptions. (Garofolo et al., 1993)
Dataset Splits	Yes	The training dataset has 1970 hums for training and 676 for validation.
Hardware Specification	No	The paper does not provide specific hardware details (like GPU or CPU models, or cloud instance types with specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using the "sklearn library" and "fuzzysearch library" and the "ADAM optimizer", but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	We train wav2tok using the ADAM (Kingma & Ba, 2017) optimizer and a linear learning schedule with a learning rate of 0.001 and 8% of the training steps as warm-up steps. The STFT matrices are computed with 513 frequency bins, a window length of 1024 samples (summarising 128 ms of audio), and hop length of 512 samples.