SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To alleviate these issues, we introduce a new ADD model that explicitly uses the Style-LInguistics Mismatch (SLIM) in fake speech to separate them from real speech...When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data.4 Experiments
Researcher Affiliation Industry Yi Zhu Surya Koppisetti Trang Tran Gaurav Bharaj Reality Defender Inc. {yi,surya,trang,gaurav}@realitydefender.ai
Pseudocode Yes Algorithm 1 shows a Py Torch-style implementation of the Stage 1 training objective...
Open Source Code No While the training script and model weights are not explicitly released, we provided sufficient details in Section 3 and Appendix A to faithfully reproduce and verify our results.
Open Datasets Yes Since only real samples are needed in Stage 1, we take advantage of open-source speech datasets by aggregating subsets from the Common Voice [3] and RAVDESS [41] as training data and use a small portion of real samples from the ASVspoof2019 LA train for validation.
Dataset Splits Yes For a fair comparison with existing works, we adopt the standard train-test partition, where only the ASVspoof2019 logical access (LA) training and development sets are used for training and validation.
Hardware Specification Yes Experiments were conducted on the Compute Canada cluster [9] with a total of four NVIDIA V100 GPUs (32GB RAM).
Software Dependencies Yes We implement our models using the Speech Brain toolkit [62] v1.0.0.
Experiment Setup Yes Stage 1 training...Stage 2 training and evaluation...The hyperparameters used for Stage 1 and Stage 2 training are provided in Appendix A.7. Table 5: Hyperparameters and architecture details of SLIM. Parameter SLIM Stage 1 Optimization Batch size 16 Epochs 50 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .005 End LR .0001 Early-stop patience 3 epochs λ .007 Training time 3h SSL frontend Style encoder Wav2vec-XLSR-SER Style layers 0-10 Linguistic encoder Wav2vec-XLSR-ASR Linguistic layers 14-21 Compression module Bottleneck layers 1 BN dropout 0.1 FC dropout 0.1 Compression output dim 256 Stage 2 Optimization Batch size 2 Epochs 10 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .0001 End LR .00001 Early-stop patience 3 epochs Training time 10h Classifier FC dropout 0.25 Stage 2 data augmentation Num augmentations 1 Concat with original True Augment prob 1 Augment choices Noise, Reverb, Spec Aug SNR_high 15d B SNR_low 0d B Reverb RIR noise Drop_freq_low 0 Drop_Freq_high 1 Drop_freq_count_low 1 Drop_freq_count_high 3 Drop_freq_width .05 Drop_chunk_count_low 1000 Drop_chunk_length_high 2000