KPNet: Towards Minimal Face Detector

Authors: Guanglu Song, Yu Liu, Yuhang Zang, Xiaogang Wang, Biao Leng, Qingsheng Yuan12015-12022

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The above question is decomposed into the following more detailed sub-problems and different controlled experiments are performed on FDDB to seek answers. We design detection backbones of various depths with different receptive fields by modifying ResNet50 (He et al. 2016). Four detectors CNN-50L, CNN-41L, CNN-26L and CNN-17L with depth 50, 41, 26 and 17 are performed on FDDB. The training set is same with (Liu et al. 2017) and the recall of top 100 proposals for each image is used for evaluation. Results are shown in Fig. 1 and the anchor setting is that A = {[16 2]} for image resolution [200, 400] to detect face scale [16, 128], 2A for image resolution [400, 600] to detect face scale [32, 256] and 4A for image resolution [800, 1000] to detect face scale [64, 512]. We evaluate KPNet on the generic face detection benchmarks FDDB (Jain and Learned-Miller 2010), AFW, MALF, and face alignment benchmark AFLW (Koestinger et al. 2011). We conduct different experiments with DRNet. For coordinate regression, we replace the SS by a fully connected layer to directly regress the keypoints coordinates. We adopt the global average pooling on the scale proposal S indicated in Sec. to convert it to a feature vector with the fixed size. A fully connected layer with output 2K is applied to regress the keypoint coordinates where K means the keypoint number and L2 loss is used for optimization. For argmax, each channel in the specific location S of M corresponding to a specific keypoint and only the coordinates existing keypoints will be set to 1. The loss function is the same as Eq. 4. In the ablation study, we further conduct FDDB 90 , FDDB90 and PFDDB as the additional benchmarks.
Researcher Affiliation Collaboration Guanglu Song,1,3 Yu Liu,2 Yuhang Zang,1 Xiaogang Wang,2 Biao Leng,3,4 Qingsheng Yuan5 1Sense Time X-Lab 2The Chinese University of Hong Kong, Hong Kong 3School of Computer Science and Engineering, Beihang University, Beijing 100191, China 4Beijing Advanced Innovation Center for Big Data and Brain Computing, Beihang University, Beijing, 100191 5National Computer network Emergency Response technical Team/Coordination Center of China
Pseudocode No The paper describes its methods through text and diagrams (Figure 2, Figure 3), but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements about making the source code for the described methodology publicly available, nor does it provide a link to a code repository.
Open Datasets Yes For training on generic face detection, we adopt the training set the same as (Liu et al. 2017) and none of the data augmentations is performed. We evaluate KPNet on the generic face detection benchmarks FDDB (Jain and Learned-Miller 2010), AFW, MALF, and face alignment benchmark AFLW (Koestinger et al. 2011). We follow (Feng et al. 2018) to adopt the AFLW-Full in our experiments where 20,000 and 4,386 images are used for training and testing, respectively.
Dataset Splits No The paper mentions "training" and "testing" splits for AFLW (20,000 training, 4,386 testing), and refers to using the same training set as (Liu et al. 2017) for generic face detection and evaluation on benchmarks like FDDB, AFW, MALF. However, it does not explicitly provide details for a separate "validation" split in terms of percentages, counts, or specific predefined split references for all experiments, only that for AFLW it states train/test.
Hardware Specification Yes Benefiting from the low-resolution input and lightweight backbone, we use a batch size of 128 and train the network on 4 GTX 1080Ti GPUs. In the offline applications, KPNet with DRNet can achieve 1000 fps at GTX 1080Ti, faster than other face detectors with a large margin.
Software Dependencies No The paper states: "We implement KPNet in Py Torch." However, it does not provide specific version numbers for PyTorch or any other software libraries or dependencies, which is necessary for reproducibility.
Experiment Setup Yes Implement details We implement KPNet in Py Torch. Both of the hourglass and DRNet are randomly initialized under the default setting of Py Torch without pretraining on any external dataset. During training, we set the input resolution of the network to 256 256, which leads to an output resolution of 128 128. For training on generic face detection, we adopt the training set the same as (Liu et al. 2017) and none of the data augmentations is performed. For joint face detection and alignment, K is set to 5 representing the left eye, right eye, nose, left corner of the mouth and right corner of the mouth. We joint optimize the loss function Lscale and the Lkeypoint with lossweight 1:1 via SGD. Due to the huge pixels in scalemap M R128 128 60, the weight of Lscale is set to 10000 for faster convergence. Benefiting from the lowresolution input and lightweight backbone, we use a batch size of 128 and train the network on 4 GTX 1080Ti GPUs. We train the network for 150k iterations with a learning rate warmup strategy. The learning rate is linearly increased to 0.01 from 0.00001 in the first 50k iterations and we reduce it to 0.001 for the last 50k iterations. At the inference stage, we first generate the scale proposals through the predefined threshold from scalemap, and then compute the corresponding keypoints via then scale adaptive soft-argmax according to Eq. 5 and Eq. 6. Finally, NMS with IOU 0.6 is adopted on the face boxes inferred from these keypoints.