Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection
Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To alleviate these issues, we introduce a new ADD model that explicitly uses the Style-LInguistics Mismatch (SLIM) in fake speech to separate them from real speech...When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data.4 Experiments |
| Researcher Affiliation | Industry | Yi Zhu Surya Koppisetti Trang Tran Gaurav Bharaj Reality Defender Inc. EMAIL |
| Pseudocode | Yes | Algorithm 1 shows a Py Torch-style implementation of the Stage 1 training objective... |
| Open Source Code | No | While the training script and model weights are not explicitly released, we provided sufficient details in Section 3 and Appendix A to faithfully reproduce and verify our results. |
| Open Datasets | Yes | Since only real samples are needed in Stage 1, we take advantage of open-source speech datasets by aggregating subsets from the Common Voice [3] and RAVDESS [41] as training data and use a small portion of real samples from the ASVspoof2019 LA train for validation. |
| Dataset Splits | Yes | For a fair comparison with existing works, we adopt the standard train-test partition, where only the ASVspoof2019 logical access (LA) training and development sets are used for training and validation. |
| Hardware Specification | Yes | Experiments were conducted on the Compute Canada cluster [9] with a total of four NVIDIA V100 GPUs (32GB RAM). |
| Software Dependencies | Yes | We implement our models using the Speech Brain toolkit [62] v1.0.0. |
| Experiment Setup | Yes | Stage 1 training...Stage 2 training and evaluation...The hyperparameters used for Stage 1 and Stage 2 training are provided in Appendix A.7. Table 5: Hyperparameters and architecture details of SLIM. Parameter SLIM Stage 1 Optimization Batch size 16 Epochs 50 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .005 End LR .0001 Early-stop patience 3 epochs λ .007 Training time 3h SSL frontend Style encoder Wav2vec-XLSR-SER Style layers 0-10 Linguistic encoder Wav2vec-XLSR-ASR Linguistic layers 14-21 Compression module Bottleneck layers 1 BN dropout 0.1 FC dropout 0.1 Compression output dim 256 Stage 2 Optimization Batch size 2 Epochs 10 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .0001 End LR .00001 Early-stop patience 3 epochs Training time 10h Classifier FC dropout 0.25 Stage 2 data augmentation Num augmentations 1 Concat with original True Augment prob 1 Augment choices Noise, Reverb, Spec Aug SNR_high 15d B SNR_low 0d B Reverb RIR noise Drop_freq_low 0 Drop_Freq_high 1 Drop_freq_count_low 1 Drop_freq_count_high 3 Drop_freq_width .05 Drop_chunk_count_low 1000 Drop_chunk_length_high 2000 |