G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

Authors: Pan Xie, Qipeng Zhang, Peng Taiying, Hao Tang, Yao Du, Zexian Li

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that our model outperforms stateof-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm. We evaluate our G2P model on RWTH-PHOENIX-WEATHER-2014T dataset (Camg oz et al. 2018).
Researcher Affiliation Academia Pan Xie1, Qipeng Zhang1, Peng Taiying1, Hao Tang2*, Yao Du1, Zexian Li1 1Beihang University 2Carnegie Mellon University
Pseudocode No The paper describes its methods but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm. (Explanation: The paper links to a project page for “generated results” but does not explicitly state that the source code for their methodology is provided there.)
Open Datasets Yes We evaluate our G2P model on RWTH-PHOENIX-WEATHER-2014T dataset (Camg oz et al. 2018).
Dataset Splits Yes This corpus contains 7,096 training samples (with 1,066 different sign glosses in gloss annotations and 2,887 words in German spoken language translations), 519 validation samples, and 642 test samples.
Hardware Specification Yes We train the model on 8 NVIDIA Tesla V100 GPUs.
Software Dependencies No The paper mentions using Open Pose (Cao et al. 2021) but does not specify version numbers for any software dependencies.
Experiment Setup Yes The Pose-VQVAE consists of an Encoder, a Tokenizer, and a Decoder. The Encoder contains a linear layer to transform pose points to a hidden feature with a dimension set as 256, a 3-layer Transformer module with divided spacetime attention (Bertasius, Wang, and Torresani 2021). The Tokenizer maintains a codebook with a size set as 2,048. The Decoder contains the same 3-layer Transformer module as the Encoder and an SPL layer to predict the structural sign skeleton. For the discrete diffusion model, we set the timestep T as 100. All Transformer blocks of Code Unet have dmodel=512 and Ndepth=2. The size of the local region l in Eq. (7), is set as 16, which is the average length of a gloss. And the number of nearest neighbors k is set as 16.