SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To alleviate these issues, we introduce a new ADD model that explicitly uses the Style-LInguistics Mismatch (SLIM) in fake speech to separate them from real speech...When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data.4 Experiments |
| Researcher Affiliation | Industry | Yi Zhu Surya Koppisetti Trang Tran Gaurav Bharaj Reality Defender Inc. {yi,surya,trang,gaurav}@realitydefender.ai |
| Pseudocode | Yes | Algorithm 1 shows a Py Torch-style implementation of the Stage 1 training objective... |
| Open Source Code | No | While the training script and model weights are not explicitly released, we provided sufficient details in Section 3 and Appendix A to faithfully reproduce and verify our results. |
| Open Datasets | Yes | Since only real samples are needed in Stage 1, we take advantage of open-source speech datasets by aggregating subsets from the Common Voice [3] and RAVDESS [41] as training data and use a small portion of real samples from the ASVspoof2019 LA train for validation. |
| Dataset Splits | Yes | For a fair comparison with existing works, we adopt the standard train-test partition, where only the ASVspoof2019 logical access (LA) training and development sets are used for training and validation. |
| Hardware Specification | Yes | Experiments were conducted on the Compute Canada cluster [9] with a total of four NVIDIA V100 GPUs (32GB RAM). |
| Software Dependencies | Yes | We implement our models using the Speech Brain toolkit [62] v1.0.0. |
| Experiment Setup | Yes | Stage 1 training...Stage 2 training and evaluation...The hyperparameters used for Stage 1 and Stage 2 training are provided in Appendix A.7. Table 5: Hyperparameters and architecture details of SLIM. Parameter SLIM Stage 1 Optimization Batch size 16 Epochs 50 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .005 End LR .0001 Early-stop patience 3 epochs λ .007 Training time 3h SSL frontend Style encoder Wav2vec-XLSR-SER Style layers 0-10 Linguistic encoder Wav2vec-XLSR-ASR Linguistic layers 14-21 Compression module Bottleneck layers 1 BN dropout 0.1 FC dropout 0.1 Compression output dim 256 Stage 2 Optimization Batch size 2 Epochs 10 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .0001 End LR .00001 Early-stop patience 3 epochs Training time 10h Classifier FC dropout 0.25 Stage 2 data augmentation Num augmentations 1 Concat with original True Augment prob 1 Augment choices Noise, Reverb, Spec Aug SNR_high 15d B SNR_low 0d B Reverb RIR noise Drop_freq_low 0 Drop_Freq_high 1 Drop_freq_count_low 1 Drop_freq_count_high 3 Drop_freq_width .05 Drop_chunk_count_low 1000 Drop_chunk_length_high 2000 |