Hand-Model-Aware Sign Language Recognition

Authors: Hezhen Hu, Wengang Zhou, Houqiang Li1558-1566

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of our method, we perform extensive experiments on four benchmark datasets, including NMFs-CSL, SLR500, MSASL and WLASL. Experimental results demonstrate that our method achieves stateof-the-art performance on all four popular benchmarks with a notable margin.
Researcher Affiliation Academia Hezhen Hu,1 Wengang Zhou, 1, 2 Houqiang Li 1, 2 1 CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China 2 Institute of Artificial Intelligence, Hefei Comprehensive National Science Center alexhu@mail.ustc.edu.cn, {zhwg, lihq}@ustc.edu.cn
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code or a statement about its release.
Open Datasets Yes We evaluate our proposed method on four publicly available datasets, including NMFs-CSL (Hu et al. 2020), SLR500 (Huang et al. 2019), MSASL (Joze and Koller 2019) and WLASL (Li et al. 2020b).
Dataset Splits Yes MSASL is an American sign language dataset (ASL) with a vocabulary size of 1,000. It is collected from Web videos. It contains 25,513 samples in total with 16,054, 5,287 and 4,172 for training, validation and testing, respectively.
Hardware Specification Yes In our experiment, all the models are implemented in Py Torch (Paszke et al. 2019) platform and trained on NVIDIA RTX-TITAN.
Software Dependencies No In our experiment, all the models are implemented in Py Torch (Paszke et al. 2019) platform and trained on NVIDIA RTX-TITAN. ... We use Open Pose (Cao et al. 2019; Simon et al. 2017) to extract the full keypoints...
Experiment Setup Yes Temporally, we extract 32 frames using random and center sampling during training and testing, respectively. During training, the input frames are randomly cropped to 256 256 at the same spatial position. Then the frames are randomly horizontally flipped with a probability of 0.5. During testing, the input video is center cropped to 256 256 and fed into the model. The model is trained with Stochastic Gradient Descent (SGD) optimizer. The weight decay and momentum are set to 1e-4 and 0.9, respectively. We set the initial learning rate as 5e-3 and reduce it by a factor of 0.1 when the validation loss is saturated. In all experiments, the hyper parameters ϵ, wβ, λspa, λtem, λreg, α0, α1 and α2 is set to 0.4, 10, 0.1, 0.1, 0.1, 1, 2.5 and 4, respectively.