reproducibilityindex.ai

Simpson’s Bias in NLP Training

Authors: Fei Yuan, Longtu Zhang, Huang Bojun, Yaobo Liang14276-14283

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show, both theoretically and experimentally, that some popular designs of the sample-level loss G may be inconsistent with the true population-level metric F of the task, so that models trained to optimize the former can be substantially sub-optimal to the latter, a phenomenon we call it, Simpson s bias, due to its deep connections with the classic paradox known as Simpson s reversal paradox in statistics and social sciences.In this paper, we systematically investigate the above assumption in several NLP tasks. We then experimentally examine and verify the practical impacts of the Simpson s bias on the training of state-of-the-art models in three different NLP tasks: Paraphrase Similarity Matching (with the DSC metric), Named Entity Recognition (with the Macro-F1 metric), and Machine Translation (with the BLEU metric).
Researcher Affiliation	Collaboration	Fei Yuan1 Longtu Zhang2 Huang Bojun2 Yaobo Liang3 1 University of Electronic Science and Technology of China 2 Rakuten Institute of Technology, Rakuten, Inc. 3 Microsoft Research Asia
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statements or links indicating the availability of open-source code for the methodology described.
Open Datasets	Yes	For PSM, we use two standard data sets: Microsoft Research Paragraph Corpus (MRPC) (Dolan and Brockett 2005) and Quora Question Pairs (QQP) (Wang et al. 2018). For NER, we ﬁne-tune BERT base multilingual cased model with different loss function (CE / Dice) on Germ Eval 2014 dataset (Benikova, Biemann, and Reznicek 2014). For MT, we train a transformer model (Vaswani et al. 2017) on IWSLT 2016 dataset using the default setting in the original paper.
Dataset Splits	No	The paper mentions general training parameters but does not provide specific train/validation/test dataset splits, sample counts, or citations to predefined splits within the provided text. While an appendix is mentioned, it is not available for analysis.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies	No	The paper mentions using a 'pre-trained BERT-base-uncased model' and refers to settings from 'Wolf et al. (2019)' (Hugging Face Transformers), but it does not specify version numbers for these software components or other libraries.
Experiment Setup	Yes	The ofﬁcially recommended parameter settings (Wolf et al. 2019) are leveraged, including max sequence length=128, epoch number=3, train batch size=32, learning rate=2e-5, and γ=1.In the experiment, we use the same setting as Wolf et al. (2019), including max sequence length=128, epoch=3, lr=5e-5, batch size = 32, γ = 1 and the Dice loss is 1 FF1.For MT, we train a transformer model (Vaswani et al. 2017) on IWSLT 2016 dataset using the default setting in the original paper, except we hold the learning rate constant as 0.0001 and set the batch size to 10K tokens after padding.