reproducibilityindex.ai

Spatial-Temporal Multi-Cue Network for Continuous Sign Language Recognition

Authors: Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li13009-13016

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate the effectiveness, we perform experiments on three large-scale CSLR benchmarks: PHOENIX-2014, CSL and PHOENIX-2014-T. Experimental results demonstrate that the proposed method achieves new state-of-the-art performance on all three benchmarks.
Researcher Affiliation	Academia	Hao Zhou, Wengang Zhou, Yun Zhou, Houqiang Li CAS Key Laboratory of GIPAS, University of Science and Technology of China zhouh156@mail.ustc.edu.cn, {zhwg, zhouyun, lihq}@ustc.edu.cn
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. There are no links or explicit statements about code release.
Open Datasets	Yes	We evaluate our method on three datasets, including PHOENIX-2014 (Koller, Forster, and Ney 2015), CSL (Huang et al. 2018; Guo et al. 2018) and PHOENIX-2014-T (Cihan Camgoz et al. 2018). ... To obtain the keypoint positions for training, we use the publicly available HRNet (Sun et al. 2019) toolbox...
Dataset Splits	Yes	The split of videos for Train, Dev and Test is 5672, 540 and 629, respectively. Our method is evaluated on the multi-signer database. ... The split of videos for Train, Dev and Test is 7096, 519 and 642, respectively.
Hardware Specification	Yes	Experiments are run on 4 GTX 1080Ti GPUs.
Software Dependencies	No	The paper mentions 'Our network architecture is implemented in Py Torch' but does not provide specific version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	In our experiments, the input frames are resized to 224 224. For data augmentation in one video, we add random crop at the same location of all frames, random discard of 20% frames and random ﬂip of all frames. For inter-cue features, the number of output channels after TCOVs and BLSTM are all set to 1024. There are 4 visual cues. For each intracue feature, the number of output channels after TCOVs and BLSTM are all set to 256. ... First, we train a VGG11-based network as DNF (Cui, Liu, and Zhang 2019) and use it to decode pseudo labels for each clip. Then, we add a fully-connected layer after each output of the TMC module. The STMC network without BLSTM is trained with cross-entropy and smooth-L1 loss by SGD optimizer. The batch size is 24 and the clip size is 16. Finally, with ﬁnetuned parameters from the previous stage, our full STMC network is trained end-to-end under joint loss optimization. We use Adam optimizer with learning rate 5 10 5 and set the batch size to 2. In all experiments, we set α to 0.6 and β to 30. ... For ﬁnetuning, we train the STMC network without BLSTM for 25 epochs. Afterward, the whole STMC network is trained end-to-end for 30 epochs. For inference, the beam width is set to 20.