Self-Supervised Audio-and-Text Pre-training with Extremely Low-Resource Parallel Data
Authors: Yu Kang, Tianqiao Liu, Hang Li, Yang Hao, Wenbiao Ding10875-10883
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our method achieves comparable performance on multiple downstream speech understanding tasks compared with the model pre-trained on fully parallel data, demonstrating the great potential of the proposed method. Our pre-training model achieves comparable performance on multiple downstream tasks compared with the model pre-trained on fully parallel data and outperforms all baselines, demonstrating the great potential of our method. |
| Researcher Affiliation | Industry | 1 TAL Education Group, Beijing, China 2 Tencent, Beijing, China {kangyu, liutianqiao, lihang4, haoyang2}@tal.com, darwinding@tencent.com |
| Pseudocode | Yes | Algorithm 1: Self-supervised Multimodal Pre-training with Low-Resource Parallel Data |
| Open Source Code | No | The paper does not provide a specific repository link, an explicit code release statement, or indicate that the code is in supplementary materials for the methodology described. |
| Open Datasets | Yes | We pre-train our model on the Libri Speech (Panayotov et al. 2015) dataset, which includes both audio recordings and corresponding authorized transcripts of English reading speech. Here, we use the widely-used dataset IEMOCAP (Busso et al. 2008). We adopt CMU-MOSEI (Zadeh et al. 2018) dataset to evaluate the sentiment analysis task |
| Dataset Splits | Yes | In our experiments, we sample the low-resource parallel corpus from train-clean-100 subset, and in order to build the non-parallel corpus, we first combine the remaining two subsets, then split it in half and take the text from one half and the audio from the other half. We follow the settings with (Xu et al. 2019) for consistent comparisons with previous works, which perform 5-fold cross-validation over sessions. |
| Hardware Specification | Yes | We pre-train our model using 4 32G-V100 GPUs with a batch size of 8 for 500,000 steps, and the whole pre-training process takes roughly 72 hours. |
| Software Dependencies | No | The paper mentions using Adam as an optimizer, but does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow versions) that would be needed for replication. |
| Experiment Setup | Yes | For the dual Transformer, each encoder in both unimodal encoders and cross-modal encoders has 3 layers, the number of multi-head attention heads is 12 and the hidden size is 768. We take Adam (Kingma and Ba 2015) as our optimizer with initial learning rate of 2e-5 and a linear-decayed learning rate schedule with warm-up (Devlin et al. 2019). We pre-train our model using 4 32G-V100 GPUs with a batch size of 8 for 500,000 steps. For audio modality, the corrupt function is a little bit different, at first, we split the audio features in separate segments according to Snum successive frames per segment, where Snum is uniformly sampled from 20 to 50. Then we randomly select 15% of these segments and for each of them, we mask it all to zero 80% of the time, replace it with the other Snum randomly selected frames within the audio 10% of the time, and keep it unchanged for the remaining cases. we increase the probability of corruption in C from 15% to 30%. To mitigate this discrepancy, we decrease the probability of replacing the selected time-steps with masked ones from 80% to 60% during CDAE pre-trainig |