AutoLink: Self-supervised Learning of Human Skeletons and Object Outlines by Linking Keypoints
Authors: Xingzhe He, Bastian Wandt, Helge Rhodin
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate on 4 benchmarks that the trained detector has a significantly improved keypoint localization accuracy and on 6 additional datasets that it applies to a broader set of images spanning portraits, persons, animals, hands, and flowers, which we attribute to the explicit modeling of links in the graph. Figure 1 shows the diverse set of image domains it applies to, including challenging textures and uncontrolled background, how both skeleton representations as well as object outlines are learned by varying the number of keypoints, and exemplifies applications to controlled image generation. 4 Experiments In this section, we compare our results to the related methods, showing that our model is simple yet effective. Besides, we perform a number of ablation studies on hyperparameters and algorithm variants, exhibiting the robustness of our model and justifying the necessity of every model component. |
| Researcher Affiliation | Academia | Xingzhe He Bastian Wandt Helge Rhodin University of British Columbia {xingzhe, wandt, rhodin}@cs.ubc.ca |
| Pseudocode | No | The paper describes its method in prose and through mathematical equations, but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Project website: https://xingzhehe.github.io/autolink/. (a) Did you include the code, data, and instructions needed to reproduce the main exper-imental results (either in the supplemental material or as a URL)? [Yes] The code, instructions, and the pre-trained models are released in our Git Hub. |
| Open Datasets | Yes | Celeb A-aligned [63] contains 200k celebrity faces aligned in center. Human3.6m [39] contains human activity videos in static backgrounds. Deep Fashion [64] contains 53k in-shop clothes images. CUB-200-2011 [103] consists of 11,788 images of birds. Flower [77], 11k Hands [1], Horses [130], and Zebras [130] are used for qualitative experiments. Vox Celeb2 [16] and AFHQ [14] are used for pose transfer and conditional image generation, respectively. |
| Dataset Splits | Yes | We follow [102] splitting it into three subsets: Celeb A training set without MAFL (160k images), MAFL training set (19k), MAFL test set (1k). first split it into three subsets as for Celeb A-aligned, and then remove the images where a face covers less than 30% of the area, which results in 45,609 images for model training, 5,379 with keypoint labels for regression, and 283 for testing. Human3.6m [39] contains human activity videos in static backgrounds. We follow [126] considering six activities (direction, discussion, posing, waiting, greeting, walking), and using subjects 1, 5, 6, 7, 8, 9 for training and 11 for testing. This results in 796,648 images for training and 87,975 images for testing. |
| Hardware Specification | Yes | It takes 3 hours to train on a single V100 GPU. |
| Software Dependencies | No | The paper mentions using Adam optimizer, ResNet, and UNet, but does not provide specific version numbers for these or any other software libraries/dependencies. |
| Experiment Setup | Yes | We use the Adam optimizer [54] with a learning rate of 10-4 with β1 = 0.9, β2 = 0.99. The batch size is 64. We train for 20k iterations. It takes 3 hours to train on a single V100 GPU. All images are resized to 128 128. The learning rate for the edge weights is multiplied by 512 due to the small gradient of Soft Plus [22] when the value is close to 0. |