G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model
Authors: Pan Xie, Qipeng Zhang, Peng Taiying, Hao Tang, Yao Du, Zexian Li
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that our model outperforms stateof-the-art G2P models on the public SLP evaluation benchmark. For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm. We evaluate our G2P model on RWTH-PHOENIX-WEATHER-2014T dataset (Camg oz et al. 2018). |
| Researcher Affiliation | Academia | Pan Xie1, Qipeng Zhang1, Peng Taiying1, Hao Tang2*, Yao Du1, Zexian Li1 1Beihang University 2Carnegie Mellon University |
| Pseudocode | No | The paper describes its methods but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | For more generated results, please visit our project page: https://slpdiffusier.github.io/g2p-ddm. (Explanation: The paper links to a project page for “generated results” but does not explicitly state that the source code for their methodology is provided there.) |
| Open Datasets | Yes | We evaluate our G2P model on RWTH-PHOENIX-WEATHER-2014T dataset (Camg oz et al. 2018). |
| Dataset Splits | Yes | This corpus contains 7,096 training samples (with 1,066 different sign glosses in gloss annotations and 2,887 words in German spoken language translations), 519 validation samples, and 642 test samples. |
| Hardware Specification | Yes | We train the model on 8 NVIDIA Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using Open Pose (Cao et al. 2021) but does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | The Pose-VQVAE consists of an Encoder, a Tokenizer, and a Decoder. The Encoder contains a linear layer to transform pose points to a hidden feature with a dimension set as 256, a 3-layer Transformer module with divided spacetime attention (Bertasius, Wang, and Torresani 2021). The Tokenizer maintains a codebook with a size set as 2,048. The Decoder contains the same 3-layer Transformer module as the Encoder and an SPL layer to predict the structural sign skeleton. For the discrete diffusion model, we set the timestep T as 100. All Transformer blocks of Code Unet have dmodel=512 and Ndepth=2. The size of the local region l in Eq. (7), is set as 16, which is the average length of a gloss. And the number of nearest neighbors k is set as 16. |