Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Fair Deepfake Detectors Can Generalize

Authors: Harry Cheng, Ming-Hui Liu, Yangyang Guo, Tianyi Wang, Liqiang Nie, Mohan Kankanhalli

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments across multiple datasets and different backbones. The results demonstrate that our approach leads to improvements in both fairness and generalization. For instance, on the DFDC [14], DFD [1], and Celeb-DF [32] datasets, our method outperforms several the state-of-the-art (So TA) approaches.
Researcher Affiliation	Academia	Harry Cheng National University of Singapore xa EMAIL Ming-Hui Liu Shandong University EMAIL Yangyang Guo National University of Singapore EMAIL Tianyi Wang National University of Singapore EMAIL Liqiang Nie Harbin Institute of Technology (Shenzhen) EMAIL Mohan Kankanhalli National University of Singapore EMAIL
Pseudocode	Yes	Algorithm 1 Skew Computation for Each Demographic Group Require: Ground-truth labels y, predicted probabilities ˆp, binary predictions ˆy, demographic attributes gender, race 1: Convert predictions: ˆyi = I[ˆpi > 0.5] 2: for each group s S do 3: for each class c {real, fake} do 4: Compute: P(y = c \| s) from ground-truth labels 5: Compute: P(ˆy = c \| s) from predicted labels 6: Compute: Skew(s, c) = log P (ˆy=c\|s) P (y=c\|s) 7: end for 8: end for 9: Collect: maxskew, minskew across all s, c
Open Source Code	Yes	Our contributions are threefold: ... We evaluate our approach on multiple datasets and backbones, showing consistent improvements in fairness and generalization. Code is provided in the supplementary materials. ... Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: The code will be uploaded as supplementary material.
Open Datasets	Yes	Following prior work [72, 59, 58], we employed Face Forensics++ (FF++) as the training set and evaluate the generalization performance on three other datasets: DFDC [14], DFD [1], and Celeb-DF [32]. Since none of these datasets contain native demographic annotations, we follow the data processing, annotation protocol, and sensitive attribute intersection strategy of previous fairness studies [33, 70, 23].
Dataset Splits	Yes	Following prior work [72, 59, 58], we employed Face Forensics++ (FF++) as the training set and evaluate the generalization performance on three other datasets: DFDC [14], DFD [1], and Celeb-DF [32]. ... In our training-testing pipeline, the entire training set is used for model training, following standard practices [63, 39]. The testing set, on the other hand, is stratified based on the intersection of gender and race.
Hardware Specification	Yes	All experiments are conducted on a single NVIDIA H100 GPU.
Software Dependencies	No	Training employs Adam W (learning rate 1 10 3, weight decay 4 10 3) until convergence, with a batch size of 64. All input images are resized to 224 224 and normalized using Image Net statistics.
Experiment Setup	Yes	Training employs Adam W (learning rate 1 10 3, weight decay 4 10 3) until convergence, with a batch size of 64. All input images are resized to 224 224 and normalized using Image Net statistics. We employ two hyperparameters, λattr and λortho, to control the relative weights of the corresponding loss functions. To investigate their impact on model generalization, we conducted a parameter sensitivity analysis, with the results shown in Figure 3. As both parameters increase, model performance initially improves and then stabilizes. Based on empirical observations, we select λattr = 0.7 and λortho = 0.2 as default values.