Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Advanced Sign Language Video Generation with Compressed and Quantized Multi-Condition Tokenization

Authors: Cong Wang, Zexuan Deng, Zhiwei Jiang, Yafeng Yin, Fei Shen, Zifeng Cheng, Shiping Ge, Shiwei Gan, Qing Gu

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that Sign Vi P achieves state-of-the-art performance across metrics, including video quality, temporal coherence, and semantic fidelity. The code is available at https://github.com/ umnooob/signvip/. Our experimental results demonstrate that Sign Vi P achieves state-of-the-art performance across multiple evaluation metrics, including video quality, temporal coherence, and semantic fidelity.
Researcher Affiliation Academia Cong Wang1 Zexuan Deng1 Zhiwei Jiang1 Yafeng Yin1 Fei Shen2 Zifeng Cheng1 Shiping Ge1 Shiwei Gan1 Qing Gu1 1 State Key Laboratory for Novel Software Technology, Nanjing University 2 National University of Singapore EMAIL EMAIL EMAIL EMAIL EMAIL
Pseudocode No The paper describes the methodology using textual descriptions and architectural diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/ umnooob/signvip/.
Open Datasets Yes We employ two sign language datasets for experiments. (1) RWTH-2014T [5] is a German sign language dataset. It comprises 8,257 sign language videos. The dataset is divided into 7,096 training samples, 519 validation samples, and 642 test samples. To align with the 8 downsampling rate of VAE, the frame size was resized from 260 210 to 272 224. (2) How2Sign [11] is an American sign language dataset. It includes 2,456 sign language videos.
Dataset Splits Yes (1) RWTH-2014T [5] is a German sign language dataset. It comprises 8,257 sign language videos. The dataset is divided into 7,096 training samples, 519 validation samples, and 642 test samples. (2) How2Sign [11] is an American sign language dataset. It includes 2,456 sign language videos. Using the provided timestamps, we segmented the videos to create a sentence-level dataset. This dataset consists of 31,128 training samples, 2,322 test samples, and 1,741 validation samples.
Hardware Specification Yes The training of the three stages are conducted on 4 NVIDIA RTX A6000 GPUs using Adam optimizer [30], with each stage consisting of 50,000 training steps. The batch sizes of stage I, II, and III are 2, 16, and 16. Their learning rates are 1e-5, 5e-5, and 1e-6.
Software Dependencies No The paper mentions specific models like "CLIP text encoder [49]" and "GPT-2 model [48]", and "Stable Diffusion v1.5 [55]" for initialization, but does not provide specific version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes Training Details. The training of the three stages are conducted on 4 NVIDIA RTX A6000 GPUs using Adam optimizer [30], with each stage consisting of 50,000 training steps. The batch sizes of stage I, II, and III are 2, 16, and 16. Their learning rates are 1e-5, 5e-5, and 1e-6. In Multi-Condition Token Translator, we utilize a multilingual version of the CLIP model4 to enable handling of multiple spoken language texts effectively. In FSQ Autoencoder, the encoder and decoder follow the architecture of their counterparts in VAE. Specifically, FSQ Autoencoder applies 4 latent channels, with each channel having a quantization level of 5. Together, this results in a total vocabulary size of 625, computed as 54 = 625 due to the combination of levels across all channels. In Sign Video Diffusion Model, both the Denoising U-Net and the Reference Net are initialized with Stable Diffusion v1.55. The temporal-attention layers in the Denoising U-Net are initialized from Animate Diff [21]. The condition augmentation rate is set to 0.001. During inference, Sign Video Diffusion Model utilizes a guidance scale of 3.5 for CFG. Additionally, the number of inference steps is configured to 50.