Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

Authors: Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li13009-13016

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental results demonstrate that the proposed method achieves new state-of-the-art performance on all three benchmarks.
Researcher Affiliation Academia Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li CAS Key Laboratory of GIPAS, University of Science and Technology of China zhouh156@mail.ustc.edu.cn, {zhwg, zhouyun, lihq}@ustc.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code No The paper does not provide concrete access to source code for the methodology described. There are no links or explicit statements about code release.
Open Datasets Yes We evaluate our method on three datasets, including PHOENIX-2014 (Koller, Forster, and Ney 2015), CSL (Huang et al. 2018; Guo et al. 2018) and PHOENIX-2014-T (Cihan Camgoz et al. 2018). ... To obtain the keypoint positions for training, we use the publicly available HRNet (Sun et al. 2019) toolbox...
Dataset Splits Yes The split of videos for Train, Dev and Test is 5672, 540 and 629, respectively. Our method is evaluated on the multi-signer database. ... The split of videos for Train, Dev and Test is 7096, 519 and 642, respectively.
Hardware Specification Yes Experiments are run on 4 GTX 1080Ti GPUs.
Software Dependencies No The paper mentions 'Our network architecture is implemented in Py Torch' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes In our experiments, the input frames are resized to 224 224. For data augmentation in one video, we add random crop at the same location of all frames, random discard of 20% frames and random flip of all frames. For inter-cue features, the number of output channels after TCOVs and BLSTM are all set to 1024. There are 4 visual cues. For each intracue feature, the number of output channels after TCOVs and BLSTM are all set to 256. ... First, we train a VGG11-based network as DNF (Cui, Liu, and Zhang 2019) and use it to decode pseudo labels for each clip. Then, we add a fully-connected layer after each output of the TMC module. The STMC network without BLSTM is trained with cross-entropy and smooth-L1 loss by SGD optimizer. The batch size is 24 and the clip size is 16. Finally, with finetuned parameters from the previous stage, our full STMC network is trained end-to-end under joint loss optimization. We use Adam optimizer with learning rate 5 10 5 and set the batch size to 2. In all experiments, we set α to 0.6 and β to 30. ... For finetuning, we train the STMC network without BLSTM for 25 epochs. Afterward, the whole STMC network is trained end-to-end for 30 epochs. For inference, the beam width is set to 20.