Constrained Decoding for Cross-lingual Label Projection

Authors: Duong Minh Le, Yang Chen, Alan Ritter, Wei Xu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate on two cross-lingual transfer tasks, namely Named Entity Recognition and Event Argument Extraction, spanning 20 languages. The results demonstrate that our approach outperforms the state-of-the-art marker-based method by a large margin and also shows better performance than other label projection methods that rely on external word alignment1.
Researcher Affiliation Academia Duong Minh Le, Yang Chen, Alan Ritter & Wei Xu Georgia Institute of Technology {dminh6, yangc}@gatech.edu, {alan.ritter, wei.xu}@cc.gatech.edu
Pseudocode Yes Algorithm 1 Constrained DFS: Searching for top-k best hypotheses
Open Source Code Yes Our code is available at: https://github.com/duonglm38/Codec
Open Datasets Yes For Named Entity Recognition (NER), we use English Co NLL03 (Tjong Kim Sang, 2002) as train/dev data and use Masakha NER2.0 (Adelani et al., 2022) which consist of human-labeled data for 20 African languages as test data. ... For Event Argument Extraction (EAE), we use the ACE-2005 (Doddington et al., 2004), a multilingual dataset that covers English, Chinese, and Arabic.
Dataset Splits Yes We evaluate the performance of each setting in translate-dev on a sample of Masakha NER2.0 dev set for five languages (i.e., Bambara, Fon, Mossi, Yoruba, and isi Zulu).
Hardware Specification Yes For all experiments, we use 1 A40 GPUs (48GB each).
Software Dependencies No The paper mentions fine-tuning m De BERTa-v3 and m T5-large and using NLLB, but it does not specify software versions for these tools or libraries (e.g., PyTorch version, specific library versions).
Experiment Setup Yes Setup We use NLLB (No Language Left Behind) as the translation model (Costa-juss a et al., 2022) in our experiments. We fine-tune m De BERTa-v3 (276M) to act as a NER tagger following (Chen et al., 2023a); and fine-tune m T5-large (Xue et al., 2021) following the X-Gear framework (Huang et al., 2022) for EAE. For a direct comparison with existing work (Chen et al., 2023b;a; Huang et al., 2022), we report the average F1 scores across five random seeds for NER and three random seeds for EAE. More details are provided in the Appendix B. ... In our implementation, for efficiency, we send a batch of partial hypotheses in Line 19. We set the batch size equal to 16 and 12 for NER and EAE experiments, respectively. For all experiments, we search for 5 hypotheses with the highest probabilities (i.e., k = 5). ... We set δ equal to 1 for translate-train experiments in the Masakha NER2.0 dataset and set δ equal to 5 for all other experiments.