Detecting Hands and Recognizing Physical Contact in the Wild
Authors: Supreeth Narasimhaswamy, Trung Nguyen, Minh Hoai Nguyen
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a novel convolutional network based on Mask-RCNN that can jointly learn to localize hands and predict their physical contact to address this problem. To develop and evaluate our method s performance, we introduce a large-scale dataset called Contact Hands, containing unconstrained images annotated with hand locations and contact states. This network achieves approximately 7% relative improvement over a baseline network that was built on the vanilla Mask-RCNN architecture and trained for recognizing hand contact states. We train the entire network containing the bounding box regression, mask generation, and contactestimation branches end-to-end by jointly optimizing the following multi-task loss: We measure the joint hand detection and contact recognition performance using VOC average precision metric. We summarize the results of these experiments in Table 2. Ablation Studies. We conduct experiments to study the effect of different components of the Contact Estimation Branch. |
| Researcher Affiliation | Collaboration | 1Stony Brook University, Stony Brook, NY 11790, USA 2Vin AI Research, Hanoi, Vietnam {sunarasimhas, minhhoai}@cs.stonybrook.edu |
| Pseudocode | No | The paper describes methods and processes in text and with diagrams (Figure 1 and Figure 2) but does not include explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and data are available at: https://github.com/cvlab-stonybrook/Contact Hands. |
| Open Datasets | Yes | To develop and evaluate our method s performance, we introduce a large-scale dataset called Contact Hands, containing unconstrained images annotated with hand locations and contact states. Code and data are available at: https://github.com/cvlab-stonybrook/Contact Hands. For still images, we collect images from multiple sources. First, we select images that contain people from popular datasets such as MS COCO [18] and PASCAL VOC [8] datasets. |
| Dataset Splits | Yes | The total number of annotated hand instances are 58,165. We randomly sample 18,877 images from these annotated images to be our training set and 1,629 images to be our test set. There are 52,050 and 5,983 hand instances in train and test sets, respectively. |
| Hardware Specification | No | No specific hardware details (GPU model, CPU, memory, etc.) are provided for running experiments. |
| Software Dependencies | No | We implement the proposed architecture using Detectron2 [34]. The entire arithmetic operations involved can be implemented as vectorized operations within ten lines of Py Torch code. While Detectron2 and PyTorch are mentioned, specific version numbers for these or other libraries are not provided. |
| Experiment Setup | Yes | We set the number of attention maps L for the spatial attention module to be 32. The weight λ for the contact state loss Lcontact in the Eq.( 5) is set to 1. The binary cross-entropy losses for all four contact states have equal weights; i.e., we do not scale the losses. The fully-connected layers in the Contact-Estimation branch have dimensions 1024. Note that tuning the loss weights for four states, parameter L, and dimensions for fully-connected layers can likely give better results. We train the network using SGD with an initial learning rate of 0.001 and a batch size of 1. We reduced the learning rate by a factor of 10 when the performance plateaued. |