Unsupervised Speech Decomposition via Triple Information Bottleneck

Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, David Cox

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we will empirically verify the disentanglement capability of SPEECHFLOW. We will be visualizing our speech results using spectrogram and pitch contour. However, to fully appreciate the performance of SPEECH FLOW, we strongly encourage readers to refer to our online demo1. Additional experiment results can be found in ap pendix C. The frequency axis units of all the spectrograms are in k Hz, and those of the pitch contour plots are in Hz. 5.1. Confgurations The experiments are performed on the VCTK dataset (Veaux et al., 2016). The training set contains 20 speakers where each speaker has about 15 minutes of speech. The test set contains the same 20 speakers but with different utterances, which is the conventional voice conversion setting. SPEECH FLOW is trained using the ADAM optimizer (Kingma & Ba, 2014) with a batch size of 16 for 800k steps.
Researcher Affiliation Collaboration 1MIT-IBM Watson AI Lab, USA 2University of Illinois at Urbana-Champaign, USA. Correspon dence to: Yang Zhang <yang.zhang2@ibm.com>, Kaizhi Qian <kaizhiqian@gmail.com >.
Pseudocode No The paper describes the network architecture in detail and includes diagrams, but it does not provide pseudocode or an algorithm block.
Open Source Code Yes Our code is publicly available at https://github.com/auspicious3000/ Speech Split.
Open Datasets Yes The experiments are performed on the VCTK dataset (Veaux et al., 2016). The training set contains 20 speakers where each speaker has about 15 minutes of speech.
Dataset Splits No The paper mentions that "The model selection is performed on the training dataset," but it does not specify explicit training/validation splits (e.g., percentages or sample counts) for reproducibility.
Hardware Specification No The paper does not specify the hardware used (e.g., GPU models, CPU types, or memory) for running the experiments.
Software Dependencies No The paper mentions using the ADAM optimizer and a wavenet-vocoder but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes SPEECH FLOW is trained using the ADAM optimizer (Kingma & Ba, 2014) with a batch size of 16 for 800k steps. ... Table 1. Hyperparameter settings of the encoders. Rhythm Content Pitch Conv Layers 1 3 3 Conv Dim 128 512 256 Norm Groups 8 32 16 BLSTM Layers 1 2 1 BLSTM Dim 1 8 32 Downsample Factor 8 8 8