Towards Highly Accurate and Stable Face Alignment for High-Resolution Videos

Authors: Ying Tai, Yicong Liang, Xiaoming Liu, Lei Duan, Jilin Li, Chengjie Wang, Feiyue Huang, Yu Chen8893-8900

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on 300W, 300VW and Talking Face datasets clearly demonstrate that the proposed method is more accurate and stable than the state-of-the-art models. We conduct extensive experiments on both image and video-based alignment datasets, including 300W (Sagonas et al. 2013), 300-VW (Shen et al. 2017) and Talking Face (TF) (FGNET 2014).
Researcher Affiliation Collaboration Youtu Lab, Tencent, Michigan State University Fudan University, Nanjing University of Science and Technology
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes https://github.com/tyshiwo/FHR alignment
Open Datasets Yes We conduct extensive experiments on both image and video-based alignment datasets, including 300W (Sagonas et al. 2013), 300-VW (Shen et al. 2017) and Talking Face (TF) (FGNET 2014).
Dataset Splits No The paper specifies training and testing sets, but does not explicitly provide details about a separate validation split with percentages or sample counts.
Hardware Specification Yes Training our FHR on 300W takes 7 hours on a P100 GPU.
Software Dependencies Yes We train the network with the Torch7 toolbox (Collobert, Kavukcuoglu, and Farabet 2011), using the RMSprop algorithm with an initial learning rate of 2.5 × 10−4, a minibatch size of 6 and σ = 3.
Experiment Setup Yes We train the network with the Torch7 toolbox (Collobert, Kavukcuoglu, and Farabet 2011), using the RMSprop algorithm with an initial learning rate of 2.5 × 10−4, a minibatch size of 6 and σ = 3. During the stabilization training, we set λ1 = λ3 = 1 and λ2 = 10 to make all terms in the stabilization loss (11) on the same order of magnitude. We estimate the average variance ρ of z(t)i − p(t)i across all training videos and all landmarks, and empirically set the initial value of Γnoise as ρI. Also, we initialize Γ1 as a zero matrix O2M×2M, Γ2 as 10ρI, and γ = β1 = β2 = 0.5.