DF40: Toward Next-Generation Deepfake Detection

Zhiyuan Yan

Taiping Yao

Shen Chen

Yandan Zhao

Xinghe Fu

Junwei Zhu

Donghao Luo

Chengjie Wang

Shouhong Ding

Yunsheng Wu

Li Yuan

[ArXiv]

[GitHub]

[Dataset]

[Checkpoints]

Is it possible to detect the various types of AI-generated faces (e.g., face-swapping, talking-head, AIGC, etcs)? This work proposes a comprehensive dataset called DF40, which comprises 40 distinct synthesis techniques, including 10 face-swapping (FS), 12 face-reenactment (FR), 10 entire face synthesis (EFS), and 5 face editing (FE) methods. We then conduct more than 2,000+ evaluations on a standard benchmark, leading to several new findings with insightful analysis.

Why using DF40? (highlights)

- Realism and diversity: DF40 contains the latest and most realistic synthesis techniques from various types such as HeyGen (FR), DeepFaceLab (FS), MidJourney-v6 (EFS), Collaborative-Diffusion (FE), etc.
- Multiple scenarios: DF40 contains 31 known/white-box synthesis methods (both the real data and fake methods are known) and 9 unknown/black-box methods (one of the real data and fake methods is unknown).
- Aligned data domain: the proposed 31 known/white-box methods are applied to the real data from the widely used FF++ and Celeb-DF datasets, meaning our generated fakes and their original fakes are from the same data domain.
- Comprehensive benchmarking: Our benchmark conduct more than 2,000+ evaluations with 4 standard evaluation protocols, leading to several new findings with insightful analysis.

Why doing this? (motivation)

In this work, we aim to address the following challenges in the current deepfake detection research (especially the datasets):
(1) Forgery Diversity: Deepfake techniques are commonly referred to as both face forgery (face-swapping and face-reenactment) and entire image synthesis (AIGC, especially face). Most existing datasets only contain partial types of them, with limited forgery methods implemented (e.g., 2 swapping and 2 reenactment methods in FF++);
(2) Forgery Realism: The dominated training dataset, such as FF++, contains out-of-date forgery techniques from the past four years. "Honing skills" on these forgeries makes it difficult to guarantee effective detection generalization toward nowadays' SoTA deepfakes;
(3) Evaluation Protocol: Most detection works perform evaluations on one type, e.g., face-swapping types only, which hinders the development of universal deepfake detectors.
To address this dilemma, we construct a highly diverse and large-scale deepfake detection dataset called DF40, which comprises 40 distinct deepfake techniques with realism, diversity, and comprehensivity. We also open up several valuable yet previously underexplored research questions to inspire future works.

Main Results

We conduct main evaluations using four types of protocols, including Cross-forgery evaluation (Protocol-1), Cross-domain evaluation (Protocol-2), Toward unknown forgery and domain evaluation (Protocol-3), and One-Verse-All (OvA) evaluation (Protocol-4).
(Results-1) The results of Protocol-1:

(Results-2) The results of Protocol-2:

(Results-3) The results of Protocol-3:

(Results-4) The results of Protocol-4:

Additional Results & Analysis

We also provide the following additional results to enlarge our evaluation scope.
(Results-5) Train on the DF40 "known" methods and test on "unknown" methods of DF40:

(Results-6) Train on DF40 and test on non-face AIGCs (GenImage dataset):

(Results-7) Train on previous dataset (FaceForensics++) and test on DF40:

(Results-8) Train on DF40 and test on other deepfake datasets:

(Analysis-1) Logits and confidence distribution analysis for both fake and real classes:

(Analysis-2) t-SNE visualizations for different models on distinct testing data:

(Analysis-3) Frequency-level artifacts analysis for different synthesis methods:

Takeaways & Findings

Here, we very briefly outline several key takeaways to encapsulate our contributions and conclusions:
- 1. The dataset is pivotal in addressing the generalization problem in deepfake detection. To this end, we build a highly diverse deepfake dataset incorporating 40 distinct deepfake techniques, including the most recent ones. The visual example can be seen in the Supplementary of our paper;
- 2. Data domain and forgery method collectively determine the final detection results (see the causal graph in our main paper for intuitive understanding);
- 3. Many recent face-swapping methods (e.g., SimSwap) do NOT involve a blending process but generate all content directly (including the background).
- 4. Blending is not all you need to detect deepfakes, even face-swapping forgeries.
- 5. CLIP-large is the most powerful baseline model due to its notable ability to learn robust real-face distribution.
- 6. The model trained on the face domain (our dataset) can also be somehow applied to the non-face domain (pure AIGC) fakes.
- 7. There is an urgent need to develop an effective incremental learning framework aiming to produce improved results, creating "1+1>2" results when combining many different forgeries together for training.

Visual Examples

We show visual examples of FS, FR, EFS, and unknown methods (including FE) from our DF40 dataset, illustrated below.

Paper and Supplementary Material

Zhiyuan Yan, Taiping Yao, Shen Chen, Yandan Zhao, Xinghe Fu, Junwei Zhu, Donghao Luo, Li Yuan, Chengjie Wang, Shouhong Ding, Yunsheng Wu.
DF40: Toward Next-Generation Deepfake Detection.
(hosted on ArXiv)

[Bibtex]

Acknowledgements

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.