BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization

Authors: Weichao Zhao, Hezhen Hu, Wengang Zhou, Jiaxin Shi, Houqiang Li

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.
Researcher Affiliation Collaboration Weichao Zhao1, Hezhen Hu1*, Wengang Zhou1, 2 , Jiaxin Shi3, Houqiang Li1, 2 1CAS Key Laboratory of GIPAS, EEIS Department, University of Science and Technology of China (USTC) 2Institute of Artificial Intelligence, Hefei Comprehensive National Science Center 3Huawei Inc.
Pseudocode No The paper describes the system architecture and processes but does not include any pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about open-source code availability or a link to a code repository.
Open Datasets Yes We conduct experiments on four public sign language datasets, i.e., NMFs-CSL (Hu et al. 2021b), SLR500 (Huang et al. 2018), WLASL (Li et al. 2020a) and MSASL (Joze and Koller 2018). The training sets of all datasets participate in the pre-training stage. Table 1 presents an overview of the above-mentioned datasets.
Dataset Splits Yes NMFs-CSL is a large-scale Chinese sign language (CSL) dataset with a vocabulary size of 1,067. All samples are split into 25,608 and 6,402 samples for training and testing, respectively. SLR500 is another CSL dataset including 500 daily words performed by 50 signers. It contains a total of 125,000 samples, of which 90,000 and 35,000 samples are utilized for training and testing, respectively. ... Following (Hu et al. 2021a), we temporally select 32 frames using random and center temporal sampling during training and testing, respectively.
Hardware Specification Yes All experiments are implemented by PyTorch and trained on NVIDIA RTX 3090.
Software Dependencies No The paper mentions 'PyTorch' and 'MMPose' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes For the tokenizer, the vocabulary size of hand codebook M1 and body codebook M2 are 1000 and 500, respectively. The dimension of each codeword in two codebooks is 512. The weighting factors β1, β2 and β3 in equation 3 are set to 0.1, 1.0 and 0.9, respectively. During pre-training, the Transformer encoder contains 8 heads with the input size of the Transformer encoder D as 1536 and position-wise feed-forward dimension as 2048. Training Setup. The Adam (Kingma and Ba 2014) optimizer is employed in our experiments. For tokenizer training, we set the initial learning rate as 0.001 and decrease it with a factor of 0.1 per 10 epochs. For pre-training, the weight decay and momentum are set to 0.01 and 0.9, respectively. The learning rate is set to 0.0001, with a warmup of 6 epochs, and linear learning rate decay. For the downstream SLR task, the learning rate is initialized to 0.0001 and decreases by a factor of 0.1 per 10 epochs.