TUSK: Task-Agnostic Unsupervised Keypoints

Authors: Yuhe Jin, Weiwei Sun, Jan Hosang, Eduard Trulls, Kwang Moo Yi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show experiments on multiple-instance detection and classification, object discovery, and landmark detection all unsupervised with performance on par with the state of the art, while also being able to deal with multiple instances. 4 Experiments We apply our method to three tasks: multiple-instance object detection, object discovery, and landmark detection. We use five different datasets.
Researcher Affiliation Collaboration Yuhe Jin1, Weiwei Sun1, Jan Hosang2, Eduard Trulls2, Kwang Moo Yi1 1The University of British Columbia, 2Google Research
Pseudocode No The paper describes the proposed framework and methods through text and diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No We will release the code once the paper is accepted.
Open Datasets Yes MNIST-Hard [1] contains synthetically-generated images composed of multiple MNIST digits. CLEVR[27] contains visual scenes with a variable number of objects in each scene. Tetrominoes[27] contains 80K images of Tetrominoes, a geometric shape composed of four squares. Celeb A [35] contains 200k images of human faces. Human 3.6M (H36M) [22] contains 3.6M captured human images with ground truth joint locations from 11 actors (7 for training, 4 for test) and 17 activities.
Dataset Splits No For MNIST-Hard, 'We generate 50K such images for training and testing, respectively.' For CLEVR, 'We train our model with the first 60K images and evaluate with the last 10K.' For Tetrominoes, 'We train our model using the first 50K images and evaluate with the last 10K.' For Celeb A, 'using all images except for the ones in the MAFL (Multi-Attribute Facial Landmark) test set, and train a linear regressor... on the MAFL training set.' For H36M, 'using 6 actors in the training set for training and the last one for evaluation.' While training and testing splits are provided, explicit *validation* splits are not mentioned for any dataset.
Hardware Specification No The paper states, 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We report this in the Supplementary Material.' However, the provided text does not contain these details, and no specific hardware models are mentioned in the main body.
Software Dependencies No The paper mentions software components such as 'VGG16 network' and 'U-Net', but does not provide specific version numbers for any software dependencies or libraries required for replication.
Experiment Setup Yes We found C=32 to be sufficient for all datasets and tasks. We use either the mean squared error (MSE) or the perceptual loss [24]. Specifically, we train only the encoder and the decoder for one step via λrecon Lrecon + λkm Lkm + λeqv Leqv (see Supplementary Material for details). We then train the prototypes and their GMM for eight steps using Lsw. We repeat this iterative training process until convergence. We thus extract K=9 keypoints and M=10 prototypes to represent them. We use K=6 keypoints and M=48 prototypes for CLEVR and K=3 and M=114 for Tetrominoes. We extract K=4 keypoints with M=32 prototypes for Celeb A and K=16 keypoints with M=32 prototypes for H36M.