reproducibilityindex.ai

Deep Structured Prediction for Facial Landmark Detection

Authors: Lisha Chen, Hui Su, Qiang Ji

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our methods on popular benchmark facial landmark detection datasets, including 300W [35], Menpo [49], COFW [6], 300VW [1]. Evaluation metrics. We evaluate our algorithm using the standard normalized mean error (NME) and the Cumulative Errors Distribution (CED) curve. Besides, the area-under-the-curve (AUC) and the failure rate (FR) for a maximum error of 0.07 are reported. Implementation details. To make a fair comparison with the So A purely deep learning based methods [5], we use the same training and testing procedure for 2D landmark detection. The 3D deformable model was trained on the 300W-train dataset or 300W-LP dataset by structure from motion [4]. For CNN, we use 4 stacks of Hourglass with the same structure as [5], each stack followed by a softmax layer to output a probability map for each facial landmark. From the probability map, we compute mean µi and covariance Σi. And we use additional softmax cross entropy loss and L1 loss on the mean [38] to assist training which shows better performance empirically. Training procedure: The initial learning rate η1 is 10 4 for 15 epochs using a minibatch of 10, then dropped to 10 5 and 10 6 after every 15 epochs and keep training until convergence. The learning rate η2 is set to 10 3. We applied random augmentations such as random cropping, rotation, etc. We ﬁrst train the method on 300W-LP [54] dataset which is augmented from the original 300W dataset for large yaw pose. And then we ﬁne-tune on the original 300W train dataset. Testing procedure: We follow the same testing procedure as [5]. The face is cropped using the ground truth bounding box deﬁned in 300W. The cropped face is rescaled to 256 256 before passed to the network. For the Menpo-proﬁle dataset, the annotation scheme is different, we use the overlapping 26 points for evaluation, i.e., removing points other than the 2 endpoints on the face contour and the eyebrow respectively and removing the 5th point on the nose contour. 4.1 Comparison with existing approaches In Table 1, we compare with some most recent best results reported, in the 300W protocol that trains on LFPW-train, HELEN-train, AFW and tests on LFPW-test, HELEN-test, ibug and use NME normalized with inter-ocular/pupil distance as the metric. In Table 2, we compare with other baseline facial landmark detection algorithms, including purely deep learning based methods such as TCDCN [50] and FAN [5] as well as hybrid methods such as CLNF [2] and CE-CLM [48]. The results for these methods are evaluated using the code provided by the authors in the same experiment protocol, i.e., same bounding box and same evaluation metrics. The CED curves on the 300W testset are shown in Fig. 3a. Cross-dataset Evaluation Besides 300W testset, we evaluate the proposed method on Menpo dataset, COFW-68 testset, 300VW testset for cross dataset evaluation. The results are shown in Table 2 for Menpo and COFW-68 dataset and Table 3 for 300VW dataset. And the CED curves are shown in Fig. 3b, 3c, 3d respectively. The method is trained on 300W-LP and ﬁne-tuned on 300W Challenge train set for 68 landmarks. We can see that compared to the results on 300W testset and Menpo-frontal dataset, where the So A methods attaining saturating performance as mentioned in [5], for cross-dataset evaluation in more challenging conditions such as COFW with heavy occlusion and Menpo-proﬁle with large pose, the proposed method shows better generalization ability with a signiﬁcant performance improvement. On the other hand, the proposed method shows smallest failure rate (FR) on all evaluated datasets. 4.2 Analysis In this section, we report the results of sensitivity analysis and ablation study.
Researcher Affiliation	Collaboration	Lisha Chen1, Hui Su1,2, Qiang Ji1 1Rensselaer Polytechnic Institute, 2IBM Research
Pseudocode	Yes	Algorithm 1: Learning CNN-CRF Input: training data {xm, ym, m = 1, . . . , M}; Initialization: parameters Θ0 = {θ0 1 = randn, C0 ij = 0}, t = 0 ; while not converge do ... Algorithm 2: Inference for CNN-CRF Input: face image x Initialization: y0 i = µi, i = 1, . . . , N , t = 0; while not converge do ...
Open Source Code	No	The paper does not provide an explicit statement or link to the open-source code for the methodology described.
Open Datasets	Yes	We evaluate our methods on popular benchmark facial landmark detection datasets, including 300W [35], Menpo [49], COFW [6], 300VW [1]. 300W has 68 landmark annotation. It contains 3837 faces for training and 300 indoor and 300 outdoor faces for testing. Menpo contains images from AFLW and FDDB with landmark re-annotation following the 68 landmark annotation scheme. It has two subsets, Menpo-frontal which has 68 landmark annotations for near frontal faces (6679 samples) and Menpo-proﬁle which has 39 landmark annotations for proﬁle faces (2300 samples). We use it as a test set for cross dataset evaluation. COFW has 1345 training samples and 507 testing samples, whose facial images are all partially occluded. The original dataset is annotated with 29 landmarks. We use the COFW-68 test set [19] which has 68 landmarks re-annotation for cross dataset evaluation. 300VW is a facial video dataset with 68 landmarks annotation. It contains 3 scenarios: 1) constrained laboratory and naturalistic well-lit conditions; 2) unconstrained real-world conditions with different illuminations, dark rooms, overexposed shots, etc.; 3) completely unconstrained arbitrary conditions including various illumination, occlusions, make-up, expression, head pose, etc. We use the test set for cross dataset evaluation.
Dataset Splits	No	The paper mentions training and testing splits for datasets like 300W (3837 faces for training and 600 for testing), but it does not specify a separate validation split with numerical details or a clear description of how it's used.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments (e.g., GPU models, CPU types, or cloud resources).
Software Dependencies	No	The paper mentions using CNNs and CRFs, and refers to concepts like Hourglass networks and specific loss functions (softmax cross entropy loss and L1 loss), but it does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers used for implementation.
Experiment Setup	Yes	Implementation details. ... For CNN, we use 4 stacks of Hourglass with the same structure as [5], each stack followed by a softmax layer to output a probability map for each facial landmark. From the probability map, we compute mean µi and covariance Σi. And we use additional softmax cross entropy loss and L1 loss on the mean [38] to assist training which shows better performance empirically. Training procedure: The initial learning rate η1 is 10 4 for 15 epochs using a minibatch of 10, then dropped to 10 5 and 10 6 after every 15 epochs and keep training until convergence. The learning rate η2 is set to 10 3. We applied random augmentations such as random cropping, rotation, etc.