Unsupervised Speech Decomposition via Triple Information Bottleneck
Authors: Kaizhi Qian, Yang Zhang, Shiyu Chang, Mark Hasegawa-Johnson, David Cox
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we will empirically verify the disentanglement capability of SPEECHFLOW. We will be visualizing our speech results using spectrogram and pitch contour. However, to fully appreciate the performance of SPEECH FLOW, we strongly encourage readers to refer to our online demo1. Additional experiment results can be found in ap pendix C. The frequency axis units of all the spectrograms are in k Hz, and those of the pitch contour plots are in Hz. 5.1. Confgurations The experiments are performed on the VCTK dataset (Veaux et al., 2016). The training set contains 20 speakers where each speaker has about 15 minutes of speech. The test set contains the same 20 speakers but with different utterances, which is the conventional voice conversion setting. SPEECH FLOW is trained using the ADAM optimizer (Kingma & Ba, 2014) with a batch size of 16 for 800k steps. |
| Researcher Affiliation | Collaboration | 1MIT-IBM Watson AI Lab, USA 2University of Illinois at Urbana-Champaign, USA. Correspon dence to: Yang Zhang <yang.zhang2@ibm.com>, Kaizhi Qian <kaizhiqian@gmail.com >. |
| Pseudocode | No | The paper describes the network architecture in detail and includes diagrams, but it does not provide pseudocode or an algorithm block. |
| Open Source Code | Yes | Our code is publicly available at https://github.com/auspicious3000/ Speech Split. |
| Open Datasets | Yes | The experiments are performed on the VCTK dataset (Veaux et al., 2016). The training set contains 20 speakers where each speaker has about 15 minutes of speech. |
| Dataset Splits | No | The paper mentions that "The model selection is performed on the training dataset," but it does not specify explicit training/validation splits (e.g., percentages or sample counts) for reproducibility. |
| Hardware Specification | No | The paper does not specify the hardware used (e.g., GPU models, CPU types, or memory) for running the experiments. |
| Software Dependencies | No | The paper mentions using the ADAM optimizer and a wavenet-vocoder but does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | SPEECH FLOW is trained using the ADAM optimizer (Kingma & Ba, 2014) with a batch size of 16 for 800k steps. ... Table 1. Hyperparameter settings of the encoders. Rhythm Content Pitch Conv Layers 1 3 3 Conv Dim 128 512 256 Norm Groups 8 32 16 BLSTM Layers 1 2 1 BLSTM Dim 1 8 32 Downsample Factor 8 8 8 |