Conditional Variational Autoencoder for Sign Language Translation with Cross-Modal Alignment

Authors: Rui Zhao, Liang Zhang, Biao Fu, Cong Hu, Jinsong Su, Yidong Chen

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on public datasets demonstrate the effectiveness of our framework, achieving new state-of-the-art results while significantly alleviating the cross-modal representation discrepancy. The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.
Researcher Affiliation Academia 1School of Informatics, Xiamen University, China 2Key Laboratory of Digital Protection and Intelligent Processing of Intangible Cultural Heritage of Fujian and Taiwan (Xiamen University), Ministry of Culture and Tourism, China {zhsqzr, lzhang}@stu.xmu.edu.cn, ydchen@xmu.edu.cn
Pseudocode No The paper includes figures (Figure 2) illustrating the model architecture and mathematical equations, but it does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes The code and models are available at https://github.com/rzhao-zhsq/CV-SLT.
Open Datasets Yes To evaluate the effectiveness of our proposed CV-SLT, we conduct extensive experiments on the following publicly available datasets: PHOENIX14T (Camgoz et al. 2018): PHOENIX14T is the most widely used benchmark for SLT in recent years. ... CSL-daily (Zhou et al. 2021): CSL-Daily focuses on daily topics in Chinese sign language...
Dataset Splits Yes PHOENIX14T ... split into Train/Dev and Test sets of sizes 7,096/519 and 642, respectively. ... CSL-daily ... split into Train/Dev and Test sets of sizes 18,401/1,077 and 1,176, respectively.
Hardware Specification Yes All experiments are conducted on a single NVIDIA TITAN RTX GPU.
Software Dependencies No We implement our CV-SLT based on open-source SLRT2. All experiments are conducted on a single NVIDIA TITAN RTX GPU. ... Following MMTLB (Chen et al. 2022a), We use the same configuration for visual embedding and pretrained m Bart encoder-decoder.
Experiment Setup Yes We adopt a learning rate of 1e-5 and select 64 for the dimension (dz) of latent variables. The selfdistillation weight (λ) is set to 3 according to preliminary experiments. The KL annealing trick (Bowman et al. 2016) is used to avoid KL vanishing during training for the first 4K steps. During inference, we follow previous studies (Chen et al. 2022a,b) to use beam search with a length penalty of 1 and a beam size of 5. The batch size is set to 16 and AMP (Baboulin et al. 2009) is applied due to the computation limitation.