reproducibilityindex.ai

FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis

Authors: Meizhen Zheng, Peng Bai, Xiaodong Shi, Xun Zhou, Yiting Yan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results show that FT-GAN outperforms the strong baselines in SVS on the Gezi Opera synthesis task. Extensive experiments further verify that FT-GAN performs well on synthesis tasks of other operas such as Peking Opera.
Researcher Affiliation	Academia	1Department of Artificial Intelligence, School of Informatics, Xiamen University, Xiamen, China 2Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China {midon, baipeng}@stu.xmu.edu.cn, mandel@xmu.edu.cn, {xzhou, eatingyan}@stu.xmu.edu.cn
Pseudocode	Yes	Algorithm 1: Pseudo Pitch Extraction Algorithm Input: Pitch sampling point sequence P = { pk}s k=1 Parameter: Retention ratio coefficient ϵ, sample point threshold T Output: Pitch value p
Open Source Code	Yes	Audio samples, the dataset, and the codes are available at https://zhengmidon.github.io/FTGAN.github.io/.
Open Datasets	Yes	In this work, we build a high-quality Gezi Opera (a type of Chinese opera popular in Fujian and Taiwan) audiotext alignment dataset... Audio samples, the dataset, and the codes are available at https://zhengmidon.github.io/FTGAN.github.io/.
Dataset Splits	No	The paper does not explicitly state training/test/validation dataset splits (e.g., percentages or sample counts). While it mentions training data and evaluation, the split details are not provided.
Hardware Specification	Yes	All training is conducted on one RTX 3090 GPU.
Software Dependencies	No	The paper mentions tools used like Montreal Forced Aligner (MFA), Praat, and Parselmouth, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	The audio sampling rate is 24k Hz... window size of the fast Fourier transform is set to 512, and the hop size is set to 128. The spectrogram after fast Fourier transform is converted to a mel-spectrogram of 80 bins. All hidden layer sizes are set to 256. The FFT block in the pitch and phoneme encoder has 4 layers and 8 heads, the same as the Conformer in the mel predictor. The convolutional kernel size of the convolutional layer in the Conformer layer is set to 31. Both SF-GAN and ML-GAN have 3 discriminators and have the same structure: a 3-layer 2d convolutional network with a kernel size of 5. The weights of the three discriminators in SF-GAN are ρl = 0.5, ρm = 0.3, ρh = 0.2, the weights of different loss are αp = 0.1, αf = 0.1. The Learning rate is set to 0.0002, which decays at an exponential decay rate of 0.93 every 10k steps. The mini-batch size used for training in each step is 9.6k audio frames. FT-GAN is first trained for 100k steps on Hokkien speech data and then trained for 200k steps on Gezi Opera data.