reproducibilityindex.ai

TUSK: Task-Agnostic Unsupervised Keypoints

Authors: Yuhe Jin, Weiwei Sun, Jan Hosang, Eduard Trulls, Kwang Moo Yi

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show experiments on multiple-instance detection and classiﬁcation, object discovery, and landmark detection all unsupervised with performance on par with the state of the art, while also being able to deal with multiple instances. 4 Experiments We apply our method to three tasks: multiple-instance object detection, object discovery, and landmark detection. We use ﬁve different datasets.
Researcher Affiliation	Collaboration	Yuhe Jin1, Weiwei Sun1, Jan Hosang2, Eduard Trulls2, Kwang Moo Yi1 1The University of British Columbia, 2Google Research
Pseudocode	No	The paper describes the proposed framework and methods through text and diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	We will release the code once the paper is accepted.
Open Datasets	Yes	MNIST-Hard [1] contains synthetically-generated images composed of multiple MNIST digits. CLEVR[27] contains visual scenes with a variable number of objects in each scene. Tetrominoes[27] contains 80K images of Tetrominoes, a geometric shape composed of four squares. Celeb A [35] contains 200k images of human faces. Human 3.6M (H36M) [22] contains 3.6M captured human images with ground truth joint locations from 11 actors (7 for training, 4 for test) and 17 activities.
Dataset Splits	No	For MNIST-Hard, 'We generate 50K such images for training and testing, respectively.' For CLEVR, 'We train our model with the ﬁrst 60K images and evaluate with the last 10K.' For Tetrominoes, 'We train our model using the ﬁrst 50K images and evaluate with the last 10K.' For Celeb A, 'using all images except for the ones in the MAFL (Multi-Attribute Facial Landmark) test set, and train a linear regressor... on the MAFL training set.' For H36M, 'using 6 actors in the training set for training and the last one for evaluation.' While training and testing splits are provided, explicit validation splits are not mentioned for any dataset.
Hardware Specification	No	The paper states, 'Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] We report this in the Supplementary Material.' However, the provided text does not contain these details, and no specific hardware models are mentioned in the main body.
Software Dependencies	No	The paper mentions software components such as 'VGG16 network' and 'U-Net', but does not provide specific version numbers for any software dependencies or libraries required for replication.
Experiment Setup	Yes	We found C=32 to be sufﬁcient for all datasets and tasks. We use either the mean squared error (MSE) or the perceptual loss [24]. Speciﬁcally, we train only the encoder and the decoder for one step via λrecon Lrecon + λkm Lkm + λeqv Leqv (see Supplementary Material for details). We then train the prototypes and their GMM for eight steps using Lsw. We repeat this iterative training process until convergence. We thus extract K=9 keypoints and M=10 prototypes to represent them. We use K=6 keypoints and M=48 prototypes for CLEVR and K=3 and M=114 for Tetrominoes. We extract K=4 keypoints with M=32 prototypes for Celeb A and K=16 keypoints with M=32 prototypes for H36M.