Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Exploring Unbiased Deepfake Detection via Token-Level Shuffling and Mixing

Authors: Xinghe Fu, Zhiyuan Yan, Taiping Yao, Shen Chen, Xi Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate and verify the effectiveness of our method through extensive experiments on widely used evaluation datasets. ... We report both frame-level and video-level AUC scores in Table 1 and compare our methods with previous state-of-the-art methods... Ablation Study. ... Figure 4: The results of robustness evaluation on the test set of FF++ (c23). Video-level AUC (%) is reported under five different types of perturbations following (Jiang et al. 2020).
Researcher Affiliation	Collaboration	1College of Computer Science and Technology, Zhejiang University 2Youtu Lab, Tencent EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology using textual explanations and diagrams, such as Figure 2 depicting the overall pipeline. However, it does not include any explicit pseudocode blocks or algorithms with structured steps.
Open Source Code	No	The paper does not contain any statements or links indicating the release of source code for the described methodology. Phrases like 'We release our code...' or direct repository links are absent.
Open Datasets	Yes	For comprehensively assess the proposed method, we utilize five widely-used public datasets Face Forensics++ (FF++) (Rossler et al. 2019), Celeb-DF (CDF) (Li et al. 2020b), DFDC (Dolhansky et al. 2020), DFDCPreview (DFDCP) (Dolhansky et al. 2019), and DFD (Deepfakedetection 2021) in our experiments, following previous works (Mohseni et al. 2020; Yan et al. 2023c).
Dataset Splits	Yes	We use the training split of FF++ (Rossler et al. 2019) as the training set. For the cross-dataset evaluation, we test our model on datasets other than FF++. For the robustness evaluation, we test our model on the test split of FF++.
Hardware Specification	No	The paper mentions using 'ViT-B model as the backbone network of detectors' and initializes it with 'pre-trained weights from the vision encoder of CLIP', but it does not specify any hardware details like GPU models, CPU types, or memory specifications used for training or inference.
Software Dependencies	No	The paper does not provide specific version numbers for any software components, libraries, or frameworks used in the implementation, such as Python, PyTorch, or TensorFlow versions.
Experiment Setup	Yes	We utilize the ViT-B model as the backbone network of detectors. The backbone is initialized with the pre-trained weights from the vision encoder of CLIP (Radford et al. 2021) by default. We evenly sample 8 frames (Mohseni et al. 2020) from the training videos of FF++ (c23) to form the training set. For S-Branch, we divide the token map into 2 × 2 blocks in shuffling. For M-Branch, the mixing ratio r is set to 0.3. The rank for W is set to 4. The hyperparameters τ, λ1, and λ2 in loss functions are set to 0.1, 0.1, and 0.1 in the training.