Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Phoneme-Level Feature Discrepancies: A Key to Detecting Sophisticated Speech Deepfakes

Authors: Kuiyuan Zhang, Zhongyun Hua, Rushi Lan, Yushu Zhang, Yifang Guo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on four benchmark datasets demonstrate the superior performance of our method over existing state-of-the-art detection methods.
Researcher Affiliation Collaboration 1 School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China 2 School of Computer Science and Information Security, Guilin University of Electronic Technology, China 3 School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics, China 4 Alibaba Group, China
Pseudocode No The paper describes the adaptive phoneme pooling process with a figure (Figure 2) and explains techniques like the Graph Attention Module (GAT) and Random Phoneme Substitution Augmentation (RPSA) in paragraph text, but it does not present them within a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper does not provide any explicit statements about releasing source code for the described methodology, nor does it include a link to a code repository. An 'Extended version' link is provided, but it points to an arXiv paper, not code.
Open Datasets Yes Data We utilize the ASVspoof2019 (Wang et al. 2020), ASVspoof2021 (Liu et al. 2023), MLAAD (M uller et al. 2024), and In The Wild (M uller et al. 2022) datasets to evaluate our model. ... We adopt the multi-language Common Voice 6.1 corpus to train our phoneme recognition model. ... We specifically utilize the Musan (Snyder, Chen, and Povey 2015) dataset, which provides a broad range of technical and non-technical noises.
Dataset Splits No For all samples, we randomly clip a three-second clip during the training process and clip the middle three seconds for validation and testing. ... Specifically, we train and validate all the methods on the training and validation subsets and split the testing subset into the seen and unseen synthesizer parts. ... In this evaluation task, we train and validate all the detection models on the EN, DE, and ES subsets of the MLAAD dataset and test them on the In The Wild dataset and the rest of the languages of the MLAAD dataset. The paper describes how samples are processed (clipped) and which subsets of benchmark datasets are used for training, validation, and testing, but it does not provide specific percentages, absolute sample counts, or explicit citations for how these datasets are formally split into training, validation, and test sets.
Hardware Specification Yes All tests were carried out on a computer equipped with a GTX 4090 GPU, using the Py Torch programming framework.
Software Dependencies No All tests were carried out on a computer equipped with a GTX 4090 GPU, using the Py Torch programming framework. The paper mentions PyTorch as the programming framework but does not specify its version number or any other software dependencies with their respective versions.
Experiment Setup Yes Implementaion Details We utilize the Wav LM as the backbone of our phoneme recognition model. The number of edges in GAT is set to 10. The substitution probability p in RPSA is set to 0.2. We train our detection model using the Adam W optimizer (Loshchilov and Hutter 2019), where the learning rate of the copied Transformer is set to 5e 5 and that of other learnable parameters is set to 1e 4. We introduce two data augmentation strategies and an early-stopping technique for every detection approach. Specifically, the data augmentation involves adding random Gaussian noise and applying random pitch adjustments to the audio samples. The early-stopping technique will terminate the training of models if there s no improvement in the area under the Receiver Operating Characteristic (ROC) Curve (AUC) score after three training epochs.