Hand-Model-Aware Sign Language Recognition
Authors: Hezhen Hu, Wengang Zhou, Houqiang Li1558-1566
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate the effectiveness of our method, we perform extensive experiments on four benchmark datasets, including NMFs-CSL, SLR500, MSASL and WLASL. Experimental results demonstrate that our method achieves stateof-the-art performance on all four popular benchmarks with a notable margin. |
| Researcher Affiliation | Academia | Hezhen Hu,1 Wengang Zhou, 1, 2 Houqiang Li 1, 2 1 CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 2 Institute of Artificial Intelligence, Hefei Comprehensive National Science Center alexhu@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any concrete access to source code or a statement about its release. |
| Open Datasets | Yes | We evaluate our proposed method on four publicly available datasets, including NMFs-CSL (Hu et al. 2020), SLR500 (Huang et al. 2019), MSASL (Joze and Koller 2019) and WLASL (Li et al. 2020b). |
| Dataset Splits | Yes | MSASL is an American sign language dataset (ASL) with a vocabulary size of 1,000. It is collected from Web videos. It contains 25,513 samples in total with 16,054, 5,287 and 4,172 for training, validation and testing, respectively. |
| Hardware Specification | Yes | In our experiment, all the models are implemented in Py Torch (Paszke et al. 2019) platform and trained on NVIDIA RTX-TITAN. |
| Software Dependencies | No | In our experiment, all the models are implemented in Py Torch (Paszke et al. 2019) platform and trained on NVIDIA RTX-TITAN. ... We use Open Pose (Cao et al. 2019; Simon et al. 2017) to extract the full keypoints... |
| Experiment Setup | Yes | Temporally, we extract 32 frames using random and center sampling during training and testing, respectively. During training, the input frames are randomly cropped to 256 256 at the same spatial position. Then the frames are randomly horizontally flipped with a probability of 0.5. During testing, the input video is center cropped to 256 256 and fed into the model. The model is trained with Stochastic Gradient Descent (SGD) optimizer. The weight decay and momentum are set to 1e-4 and 0.9, respectively. We set the initial learning rate as 5e-3 and reduce it by a factor of 0.1 when the validation loss is saturated. In all experiments, the hyper parameters ϵ, wβ, λspa, λtem, λreg, α0, α1 and α2 is set to 0.4, 10, 0.1, 0.1, 0.1, 1, 2.5 and 4, respectively. |