Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces

Authors: Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 Experiments In this section, we empirically demonstrate the effectiveness of our unsupervised cross-modal alignment approach on spoken word classification and translation introduced in Section 4.
Researcher Affiliation Academia Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA {andyyuan,ckbjimmy,st9,glass}@mit.edu
Pseudocode No Insufficient information. The paper describes the methods in text and equations but does not provide structured pseudocode or algorithm blocks.
Open Source Code No Insufficient information. The paper does not contain any explicit statements about releasing source code or provide a link to a code repository.
Open Datasets Yes For our experiments, we used English and French Libri Speech [46, 47], and English and German Spoken Wikipedia Corpora (SWC) [48].
Dataset Splits No Insufficient information. The paper explicitly mentions 'Train' and 'Test' sets with their sizes, but does not provide details about a 'validation' set or its specific split.
Hardware Specification No Insufficient information. The paper does not specify the hardware (e.g., CPU, GPU models) used for running the experiments.
Software Dependencies No Insufficient information. The paper mentions using 'fastText implementation' but does not specify its version number or any other software dependencies with version details.
Experiment Setup Yes The speech embeddings were trained using Speech2Vec with Skip-grams by setting the window size k to three. The Encoder is a single-layer bidirectional LSTM, and the Decoder is a single-layer unidirectional LSTM. The model was trained by stochastic gradient descent (SGD) with a fixed learning rate of 10^-3. The dimension of both speech and text embeddings is 50. For the adversarial training, the discriminator was a two-layer neural network of size 512 with ReLU as the activation function. Both the discriminator and W were trained by SGD with a fixed learning rate of 10^-3.