Speech-Forensics: Towards Comprehensive Synthetic Speech Dataset Establishment and Analysis

Authors: Zhoulin Ji, Chenhao Lin, Hang Wang, Chao Shen

IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our model achieves an average m AP of 83.55% and an EER of 5.25% at the utterance level. At the segment level, it attains an EER of 1.07% and a 92.19% F1 score. These results highlight the model s robust capability for a comprehensive analysis of synthetic speech, offering a promising avenue for future research and practical applications in this field.
Researcher Affiliation Academia Zhoulin Ji1 , Chenhao Lin 2 , Hang Wang3,4 and Chao Shen2 1School of Software Engineering, Xi an Jiaotong University 2School of Cyber Science and Engineering, Xi an Jiaotong University 3School of Automation Science and Engineering, Xi an Jiaotong University 4Department of Computing, The Hong Kong Polytechnic University
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The dataset and code are available at https://github.com/ring-zl/Speech-Forensics.
Open Datasets Yes We utilized the LJ Speech dataset*, a public domain collection of 13,100 audio clips with transcripts, read by a single speaker. *https://keithito.com/LJ-Speech-Dataset/
Dataset Splits No The paper mentions 'mini-batch training' and 'training included a warmup phase' but does not provide specific details on training/validation/test dataset splits (e.g., percentages, sample counts, or explicit standard split references).
Hardware Specification Yes Our experiments were conducted on a NVIDIA 4090 GPU.
Software Dependencies No The paper mentions the use of 'Adam W optimizer' but does not specify software dependencies with version numbers for reproducibility (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We adopted mini-batch training using the Adam W [Loshchilov and Hutter, 2017] optimizer. The training included a warmup phase of 5 epochs, followed by a cosine decay schedule for the learning rate. The initial learning rate was set at 1e 3, with a weight decay of 1e 3.