Multilingual Alignment of Contextual Word Representations

Authors: Steven Cao, Nikita Kitaev, Dan Klein

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose procedures for evaluating and strengthening contextual embedding alignment and show that they are useful in analyzing and improving multilingual BERT. In particular, after our proposed alignment procedure, BERT exhibits significantly improved zero-shot performance on XNLI compared to the base model, remarkably matching pseudo-fully-supervised translate-train models for Bulgarian and Greek. Further, to measure the degree of alignment, we introduce a contextual version of word retrieval and show that it correlates well with downstream zero-shot transfer. Using this word retrieval task, we also analyze BERT and find that it exhibits systematic deficiencies, e.g. worse alignment for open-class parts-of-speech and word pairs written in different scripts, that are corrected by the alignment procedure.
Researcher Affiliation Academia Steven Cao, Nikita Kitaev & Dan Klein Computer Science Division University of California, Berkeley {stevencao,kitaev,klein}@berkeley.edu
Pseudocode No The paper does not include any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing open-source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes As our dataset, we use the Europarl corpora for English paired with Bulgarian, German, Greek, Spanish, and French... We use the most recent 1024 sentences as the test set, the previous 1024 sentences as the development set, and the following 250K sentences as the training set. ...we also report numbers for 10K and 50K parallel sentences.
Dataset Splits Yes We use the most recent 1024 sentences as the test set, the previous 1024 sentences as the development set, and the following 250K sentences as the training set.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., CPU, GPU models) used to run the experiments.
Software Dependencies No The paper mentions software like "fast Align", "polyglot", and "spaCy" that were used, but it does not specify any version numbers for these or other key software components.
Experiment Setup Yes For both alignment and XNLI optimization, we use a learning rate of 5 10 5 with Adam hyperparameters β = (0.9, 0.98), ϵ = 10 9 and linear learning rate warmup for the first 10% of the training data. For alignment, the model is trained for one epoch, with each batch containing 2 sentence pairs per language. For XNLI, each model is trained for 3 epochs with 32 examples per batch, and 10% dropout is applied to the BERT embeddings.