Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces
Authors: Yu-An Chung, Wei-Hung Weng, Schrasing Tong, James Glass
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 Experiments In this section, we empirically demonstrate the effectiveness of our unsupervised cross-modal alignment approach on spoken word classification and translation introduced in Section 4. |
| Researcher Affiliation | Academia | Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass Computer Science and Artiļ¬cial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, MA 02139, USA {andyyuan,ckbjimmy,st9,glass}@mit.edu |
| Pseudocode | No | Insufficient information. The paper describes the methods in text and equations but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | No | Insufficient information. The paper does not contain any explicit statements about releasing source code or provide a link to a code repository. |
| Open Datasets | Yes | For our experiments, we used English and French Libri Speech [46, 47], and English and German Spoken Wikipedia Corpora (SWC) [48]. |
| Dataset Splits | No | Insufficient information. The paper explicitly mentions 'Train' and 'Test' sets with their sizes, but does not provide details about a 'validation' set or its specific split. |
| Hardware Specification | No | Insufficient information. The paper does not specify the hardware (e.g., CPU, GPU models) used for running the experiments. |
| Software Dependencies | No | Insufficient information. The paper mentions using 'fastText implementation' but does not specify its version number or any other software dependencies with version details. |
| Experiment Setup | Yes | The speech embeddings were trained using Speech2Vec with Skip-grams by setting the window size k to three. The Encoder is a single-layer bidirectional LSTM, and the Decoder is a single-layer unidirectional LSTM. The model was trained by stochastic gradient descent (SGD) with a fixed learning rate of 10^-3. The dimension of both speech and text embeddings is 50. For the adversarial training, the discriminator was a two-layer neural network of size 512 with ReLU as the activation function. Both the discriminator and W were trained by SGD with a fixed learning rate of 10^-3. |