Style Equalization: Unsupervised Learning of Controllable Generative Sequence Models
Authors: Jen-Hao Rick Chang, Ashish Shrivastava, Hema Koppula, Xiaoshuai Zhang, Oncel Tuzel
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply style equalization to textto-speech and text-to-handwriting synthesis on three datasets. We conduct thorough evaluation, including both quantitative and qualitative user studies. Our results show that by mitigating the training-inference mismatch with the proposed style equalization, we achieve style replication scores comparable to real data in our user studies. |
| Researcher Affiliation | Collaboration | Jen-Hao Rick Chang 1 Ashish Shrivastava 1 Hema Swetha Koppula 1 Xiaoshuai Zhang 2 Oncel Tuzel 1 1Apple 2University of California, San Diego. Correspondence to: Jen-Hao Rick Chang <jenhao_chang@apple.com>. |
| Pseudocode | No | The paper describes its model architecture using diagrams and detailed textual descriptions with tables of block formulations (Table 4), but it does not provide any pseudocode or algorithm blocks. |
| Open Source Code | Yes | https://apple.github.io/ml-style-equalization |
| Open Datasets | Yes | VCTK dataset (Yamagishi et al., 2019) contains 110 speakers and 44 hours of speech, and Libri TTS dataset (Zen et al., 2019) contains 2,311 speakers and 555 hours of speech in the training set. |
| Dataset Splits | Yes | We split the training set of Libri TTS-all-960 into the training, validation, and test sets by a ratio of 85%, 7.5%, 7.5%, respectively. |
| Hardware Specification | Yes | The model is trained for 100 epochs on a machine with 8 A100 GPUs, and the training took 36 hours. The VCTK model is trained for 70 epochs on a machine with 8 A100 GPUs, and the training took 12 hours; the Libri TTS model is trained for 25 epochs on a machine with 8 A100 GPUs, and the training took 3 days. During inference, the Libri TTS model generates 380 mel-spectrogram frames (4.4 seconds) per second on a single A100 GPU. |
| Software Dependencies | No | The paper mentions tools like ESPnet-TTS (Hayashi et al., 2020) and ESPnet (kamo naoyuki, 2021) for baselines and pretrained models, and optimizers like ADAM (Kingma & Ba, 2015), but it does not specify software dependencies (e.g., programming languages, libraries, or frameworks) with their version numbers needed to reproduce their own experiments. |
| Experiment Setup | Yes | For handwriting, the dimension of all LSTMs are 512. The final linear layer outputs a 122-dimensional vector... mixture of 20 bivariate Gaussian distributions... Bernoulli distribution for pen lifting... Bernoulli distribution for sequence stops. The posterior and the prior Gaussian distributions are 256-dimensional. The convolutional network in the style encoder has four layers; all of them use kernel size 3, stride 2, and no padding. Their feature dimensions are 3 -> 32 -> 64 -> 128 -> 256. We use dropout with a dropping rate equal to 0.1 after each nonlinearity in the convolutional network. The multihead attention has 4 heads, the dimension of the query, key, and value vectors are all 256. The dimension of δ (i.e., k) is 128. The input content c is represented as 200-dimensional one-hot vectors of the input characters. The standard deviation of the Gaussian noise is 0.1, and during inference, we reduce the standard deviation of the output distribution to 0.9 of the original one. For speech, all LSTMs have the same 2048 dimension... The posterior and prior Gaussian distributions are 512-dimensional, and the dimension of δ (i.e., k) is 64 for VCTK dataset and 192 for Libri TTS dataset. The standard deviation of the added noise is 0.2, and during inference, we reduce the standard deviation of the output distribution to 0.74 of the original one. We use ADAM (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98, and a learning rate schedule used by Vaswani et al. (2017) with a warm-up period of 4,000 iterations to optimize the objective function. |