TUSK: Task-Agnostic Unsupervised Keypoints
Authors: Yuhe Jin, Weiwei Sun, Jan Hosang, Eduard Trulls, Kwang Moo Yi
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show experiments on multiple-instance detection and classification, object discovery, and landmark detection all unsupervised with performance on par with the state of the art, while also being able to deal with multiple instances. 4 Experiments We apply our method to three tasks: multiple-instance object detection, object discovery, and landmark detection. We use five different datasets. |
| Researcher Affiliation | Collaboration | Yuhe Jin1, Weiwei Sun1, Jan Hosang2, Eduard Trulls2, Kwang Moo Yi1 1The University of British Columbia, 2Google Research |
| Pseudocode | No | The paper describes the proposed framework and methods through text and diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | We will release the code once the paper is accepted. |
| Open Datasets | Yes | MNIST-Hard [1] contains synthetically-generated images composed of multiple MNIST digits. CLEVR[27] contains visual scenes with a variable number of objects in each scene. Tetrominoes[27] contains 80K images of Tetrominoes, a geometric shape composed of four squares. Celeb A [35] contains 200k images of human faces. Human 3.6M (H36M) [22] contains 3.6M captured human images with ground truth joint locations from 11 actors (7 for training, 4 for test) and 17 activities. |
| Dataset Splits | No | For MNIST-Hard, 'We generate 50K such images for training and testing, respectively.' For CLEVR, 'We train our model with the first 60K images and evaluate with the last 10K.' For Tetrominoes, 'We train our model using the first 50K images and evaluate with the last 10K.' For Celeb A, 'using all images except for the ones in the MAFL (Multi-Attribute Facial Landmark) test set, and train a linear regressor... on the MAFL training set.' For H36M, 'using 6 actors in the training set for training and the last one for evaluation.' While training and testing splits are provided, explicit *validation* splits are not mentioned for any dataset. |
| Hardware Specification | No | The paper states, 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We report this in the Supplementary Material.' However, the provided text does not contain these details, and no specific hardware models are mentioned in the main body. |
| Software Dependencies | No | The paper mentions software components such as 'VGG16 network' and 'U-Net', but does not provide specific version numbers for any software dependencies or libraries required for replication. |
| Experiment Setup | Yes | We found C=32 to be sufficient for all datasets and tasks. We use either the mean squared error (MSE) or the perceptual loss [24]. Specifically, we train only the encoder and the decoder for one step via λrecon Lrecon + λkm Lkm + λeqv Leqv (see Supplementary Material for details). We then train the prototypes and their GMM for eight steps using Lsw. We repeat this iterative training process until convergence. We thus extract K=9 keypoints and M=10 prototypes to represent them. We use K=6 keypoints and M=48 prototypes for CLEVR and K=3 and M=114 for Tetrominoes. We extract K=4 keypoints with M=32 prototypes for Celeb A and K=16 keypoints with M=32 prototypes for H36M. |