Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unsupervised Mismatch Localization in Cross-Modal Sequential Data with Application to Mispronunciations Localization

Authors: Wei Wei, Hengguan Huang, Xiangming Gu, Hao Wang, Ye Wang

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our proposed ML-VAE and ML-VAE-RL on the mispronunciation localization task to test their mismatch localization ability. We first conduct experiments on a synthetic dataset (Mismatch-Audio MNIST) and then further apply our proposed models to a real-world speech-text dataset (L2-ARCTIC). Table 1 shows the mispronunciation localization results of MLVAE and ML-VAE-RL on the synthetic dataset Mismatch Audio MNIST. We also perform ablation studies to verify the effectiveness of our training algorithm.
Researcher Affiliation Academia Wei Wei EMAIL National University of Singapore Hengguan Huang EMAIL National University of Singapore Xiangming Gu EMAIL National University of Singapore Hao Wang EMAIL Rutgers University Ye Wang EMAIL National University of Singapore
Pseudocode Yes The overall algorithm to learn ML-VAE is shown in Algorithm 1. Algorithm 1 Learning ML-VAE Input: Speech feature sequence X, phoneme sequence C Output: Mismatch localization result ˆC 1: Initialize the model parameters ϕp, ϕb, and ϕh. 2: Obtain the forced alignment result B. 3: while not converged do 4: Estimate ˆB and ˆΠ with ML-FSA. 5: Using Eq. 4, optimize ϕp with the phoneme sequence C. 6: Using Eq. 3, with the help of B, optimize ϕb. 7: Given ˆΠ, optimize ϕh using Eq. 6. 8: end while 9: Obtain the mismatch localization result ˆC with ML-FSA. 10: return ˆC
Open Source Code No 1Codes will be soon available at https://github.com/weiwei-ww/ML-VAE
Open Datasets Yes We first conduct experiments on a synthetic dataset, named as Mismatch-Audio MNIST, which is built based on the Audio MNIST (Becker et al., 2018) dataset. Our synthetic dataset contains 3000 audio samples, each produced by concatenating three to seven spoken digits randomly selected from the original Audio MNIST. ... We further apply ML-VAE and ML-VAE-RL to a real-world dataset: the L2-ARCTIC dataset (Zhao et al., 2018), which is a non-native English corpus containing 11026 utterances from 24 non-native speakers.
Dataset Splits Yes The dataset is split into training, validation, and test sets by a 60:20:20 ratio, and the durations of the three sets are 96.3, 33.2, and 30.6 minutes respectively.
Hardware Specification No No specific hardware details are mentioned for the experiments. The paper describes model architectures and training procedures but does not specify GPU/CPU models, processor types, or memory used for computation.
Software Dependencies No The paper mentions features like 40-dimension FBANK feature (Young et al., 2002) and references methods such as Gumbel Softmax function (Jang et al., 2016) and bidirectional SRU layers (Lei et al., 2018). However, it does not provide specific version numbers for any programming languages, libraries, or frameworks used (e.g., Python, PyTorch, TensorFlow, CUDA).
Experiment Setup Yes The boundary detector includes two LSTM layers, each with 512 nodes, followed by two fully connected (FC) layers with 128 nodes and Re LU activations. ... The weight of the KL term λb is set as 0.01. ... During training, λr and λl are set to 1 and 0.001, respectively.