PrefAce: Face-Centric Pretraining with Self-Structure Aware Distillation
Authors: Siyuan Hu, Zheng Wang, Peng Hu, Xi Peng, Jie Wu, Hongyuan Zhu, Yew Soon Ong
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments demonstrate that the proposed framework learns universal instance-aware facial representations with fine-grained landmark details from videos. Our framework also outperforms the state-of-the-art on various downstream tasks, even in low data regimes. |
| Researcher Affiliation | Academia | Siyuan Hu1*, Zheng Wang2, Peng Hu3, Xi Peng3, Jie Wu2, Hongyuan Zhu4 , Yew Soon Ong1, 4 1Nanyang Technological University, 2Wuhan University, 3Sichuan University, 4Institute for Infocomm Research (I2R) & Centre for Frontier AI Research (CFAR), A*STAR, Singapore |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/siyuan-h/Pref Ace. |
| Open Datasets | Yes | For downstream adaptation, we use 28,532 train, 3,567 val, and 3,567 test videos from Celeb V-HQ dataset. Following prior work (Zheng et al. 2022), we report average accuracy( ), Area Under the Curve (AUC ) over all attributes. Facial Expression Recognition (FER) task encodes spatiotemporal facial muscle movement patterns to predict sentiment (2-class and 7-class) and emotion (6-class) of the concerned subject given a facial video. We evaluate the performance of Pref Ace on CMU-MOSEI dataset which is a conversational corpus having 16,726 train, 1,871 val, and 4,662 test data. Following prior work (Delbrouck et al. 2020), we use overall accuracy( ) as the metric. Deepfake Detection (DFD) task predicts spatio-temporal facial forgery given a facial video from FF++(LQ) dataset. For downstream adaptation, we use 3,600 train, 700 val, and 700 test sample videos from FF++(LQ) dataset. Following prior literature (Cai et al. 2022), we use accuracy( ) and AUC( ) as the evaluation metrics. Lip Synchronization (LS) is another research area that requires facial region specific spatio-temporal synchronization. This downstream adaptation further elaborates the adaptation capability of Pref Ace for face generation tasks. For adaptation, we replace the facial encoder module in Wav2Lip (Prajwal et al. 2020) with Pref Ace, and adjust the temporal window accordingly i.e. from 5 frames to T frames. For evaluation, we use the LRS2 dataset having 45,838 train, 1,082 val, and 1,243 test videos. |
| Dataset Splits | Yes | For downstream adaptation, we use 28,532 train, 3,567 val, and 3,567 test videos from Celeb V-HQ dataset. We evaluate the performance of Pref Ace on CMU-MOSEI dataset which is a conversational corpus having 16,726 train, 1,871 val, and 4,662 test data. For downstream adaptation, we use 3,600 train, 700 val, and 700 test sample videos from FF++(LQ) dataset. For evaluation, we use the LRS2 dataset having 45,838 train, 1,082 val, and 1,243 test videos. |
| Hardware Specification | Yes | The network is trained with Py Torch (Paszke et al. 2019) on Nvidia A100 GPUs. |
| Software Dependencies | No | The paper mentions 'PyTorch (Paszke et al. 2019)' as the training framework but does not specify a precise version number for PyTorch or any other software dependency. |
| Experiment Setup | Yes | The pretraining hyperparameters are as follows: the base learning rate is linearly scaled with respect to the overall batch size, lr = base learning rate batch size/256. For self-supervised pretraining, we use Adam W optimizer with base learning rate 7.5e 4, momentum β1 = 0.9, β2 = 0.999 with a learning rate scheduler (cosine decay). For LP and FT, we use Adam W optimizer with β1 = 0.9, β2 = 0.98 and base learning rate 1e 4, without weight decay. |