Separate in the Speech Chain: Cross-Modal Conditional Audio-Visual Target Speech Extraction
Authors: Zhaoxi Mu, Xinyu Yang
IJCAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conducted experiments on two audio-visual datasets, namely LRS2-2Mix and Vox Celeb2-2Mix [Li et al., 2022], derived from the LRS2 [Afouras et al., 2022] and Vox Celeb2 [Chung et al., 2018] datasets, respectively. ... We comprehensively compared AVSep Chain and existing AV-TSE methods on the LRS2-2Mix and Vox Celeb2-2Mix datasets. The results, presented in Table 1, demonstrate that AVSep Chain achieves state-of-the-art performance on both datasets. ... In this section, we performed ablation experiments to validate the effectiveness of each key design proposed in AVSep Chain. |
| Researcher Affiliation | Academia | Zhaoxi Mu , Xinyu Yang Xi an Jiaotong University wsmzxxh@stu.xjtu.edu.cn, yxyphd@mail.xjtu.edu.cn |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not provide any specific repository link, explicit code release statement, or mention of code in supplementary materials for the methodology described. |
| Open Datasets | Yes | We conducted experiments on two audio-visual datasets, namely LRS2-2Mix and Vox Celeb2-2Mix [Li et al., 2022], derived from the LRS2 [Afouras et al., 2022] and Vox Celeb2 [Chung et al., 2018] datasets, respectively. |
| Dataset Splits | No | The paper mentions using a 'validation set' to monitor loss decrease during training ('If the loss does not decrease on the validation set for three consecutive epochs, the learning rate is halved.'), but it does not provide specific dataset split information (exact percentages, sample counts, or detailed splitting methodology) for training, validation, and testing. |
| Hardware Specification | No | No specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running experiments were mentioned in the paper. |
| Software Dependencies | No | The paper mentions using 'AV-Hu BERT', 'Hu BERT', 'AV-Sepformer', and 'Adam optimization', but it does not provide specific version numbers for these or any other ancillary software components needed to replicate the experiment. |
| Experiment Setup | Yes | For both AV-Hu BERT and Hu BERT, we employ a 12-layer BASE pre-trained model with a feature dimension of 768, denoted as Nfv = Nfa = 768, and extract the features from the last layer as embeddings. The AV-Separator is built on the same hyperparameters as the AV-Sepformer [Lin et al., 2023], comprising two repetitions of 8 Intra-Transformers, 7 Inter Transformers, and 1 Cross-Modal Transformer. The values for Na and KX are set to 256 and 160, respectively. We employ a logarithmic mel-spectrogram with 80 mel bands, a filter length of 1024, a hop size of 10 ms, a window length of 40 ms, and a Hann window to capture the spectro-temporal features Spre of the output audio spre from the AV-Separator. In other words, Nmel is 80. The intermediate feature dimension Npro is set to 256. The AV-Synthesizer consists of a cross-modal attention layer, followed by three 1-D convolution layers. The hidden dimensions for the convolution layers are 256, 128, and 160, respectively, with a kernel size of 7. ... When calculating the contrastive semantic matching loss, we set the margin m to 0.5 and λ to 1. The model is trained using Adam optimization, starting with an initial learning rate of 1.5 10 4. If the loss does not decrease on the validation set for three consecutive epochs, the learning rate is halved. The training process is terminated if there is no decrease in the loss for five consecutive epochs. |