Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SLIM: Style-Linguistics Mismatch Model for Generalized Audio Deepfake Detection

Authors: Yi Zhu, Surya Koppisetti, Trang Tran, Gaurav Bharaj

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To alleviate these issues, we introduce a new ADD model that explicitly uses the Style-LInguistics Mismatch (SLIM) in fake speech to separate them from real speech...When the feature encoders are frozen, SLIM outperforms benchmark methods on out-of-domain datasets while achieving competitive results on in-domain data.4 Experiments
Researcher Affiliation Industry Yi Zhu Surya Koppisetti Trang Tran Gaurav Bharaj Reality Defender Inc. EMAIL
Pseudocode Yes Algorithm 1 shows a Py Torch-style implementation of the Stage 1 training objective...
Open Source Code No While the training script and model weights are not explicitly released, we provided sufficient details in Section 3 and Appendix A to faithfully reproduce and verify our results.
Open Datasets Yes Since only real samples are needed in Stage 1, we take advantage of open-source speech datasets by aggregating subsets from the Common Voice [3] and RAVDESS [41] as training data and use a small portion of real samples from the ASVspoof2019 LA train for validation.
Dataset Splits Yes For a fair comparison with existing works, we adopt the standard train-test partition, where only the ASVspoof2019 logical access (LA) training and development sets are used for training and validation.
Hardware Specification Yes Experiments were conducted on the Compute Canada cluster [9] with a total of four NVIDIA V100 GPUs (32GB RAM).
Software Dependencies Yes We implement our models using the Speech Brain toolkit [62] v1.0.0.
Experiment Setup Yes Stage 1 training...Stage 2 training and evaluation...The hyperparameters used for Stage 1 and Stage 2 training are provided in Appendix A.7. Table 5: Hyperparameters and architecture details of SLIM. Parameter SLIM Stage 1 Optimization Batch size 16 Epochs 50 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .005 End LR .0001 Early-stop patience 3 epochs λ .007 Training time 3h SSL frontend Style encoder Wav2vec-XLSR-SER Style layers 0-10 Linguistic encoder Wav2vec-XLSR-ASR Linguistic layers 14-21 Compression module Bottleneck layers 1 BN dropout 0.1 FC dropout 0.1 Compression output dim 256 Stage 2 Optimization Batch size 2 Epochs 10 GPUs 4 Audio length 5s Optimizer Adam W LRscheduler Linear Starting LR .0001 End LR .00001 Early-stop patience 3 epochs Training time 10h Classifier FC dropout 0.25 Stage 2 data augmentation Num augmentations 1 Concat with original True Augment prob 1 Augment choices Noise, Reverb, Spec Aug SNR_high 15d B SNR_low 0d B Reverb RIR noise Drop_freq_low 0 Drop_Freq_high 1 Drop_freq_count_low 1 Drop_freq_count_high 3 Drop_freq_width .05 Drop_chunk_count_low 1000 Drop_chunk_length_high 2000