Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-modal and Multi-scale Spatial Environment Understanding for Immersive Visual Text-to-Speech

Authors: Rui Liu, Shuwei He, Yifan Hu, Haizhou Li

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Objective and subjective evaluations suggest that our model outperforms the advanced baselines in environmental speech generation. The paper includes sections such as "Experiments and Results", "Dataset", "Implementation Details", "Evaluation Metrics", "Baselines", "Main Results", and "Ablation Results", which are characteristic of experimental research.
Researcher Affiliation Academia The affiliations listed are "Inner Mongolia University, China", "Shenzhen Research Institute of Big Data, School of Data Science, The Chinese University of Hong Kong, Shenzhen, China", and "Department of Electrical and Computer Engineering, National University of Singapore, Singapore". All these institutions are academic.
Pseudocode No The paper describes the methodology in narrative text and uses a diagram (Figure 1), but it does not contain any explicit pseudocode or algorithm blocks.
Open Source Code Yes The paper provides a direct link to a GitHub repository: "Code and Audio Samples https://github.com/AI-S2-Lab/M2SE-VTTS"
Open Datasets Yes The paper states: "We employ the Sound Spaces-Speech dataset (Chen et al. 2023), which is developed on the Sound Spaces platform using real-world 3D scans to simulate environmental audio."
Dataset Splits Yes The paper specifies exact counts for dataset splits: "The dataset consists of 28,853 training samples, 1,441 validation samples, and 1,489 testing samples."
Hardware Specification Yes The paper explicitly states the hardware used for training: "the M2SE-VTTS model is trained on a single NVIDIA A800 GPU"
Software Dependencies No The paper mentions tools like an "open-source grapheme-to-phoneme tool" and "Parselmouth" (with a footnote link), and models like "CLIP-Vi T-L/14" and "Big VGAN", but does not provide specific version numbers for any software libraries or dependencies used for replication.
Experiment Setup Yes The paper details several hyperparameters and training configurations, including: "The phoneme vocabulary consists of 74 distinct phonemes. The cross-modal fusion module employs two attention heads, while all other attention mechanisms use four heads each. The patch number, Topk, is set to 140. ... In the denoiser module, we use five transformer layers with a hidden size of 384 and 12 heads. Each transformer block functions as the identity, with T set to 100 and β values increasing linearly from β1 = 10 4 to βT = 0.06. ... training the encoder for 120k steps until convergence. In the main training stage, the M2SE-VTTS model is trained ... extending over 160k steps until convergence. ... with a batch size of 48 sentences"