Probing Natural Language Inference Models through Semantic Fragments

Authors: Kyle Richardson, Hai Hu, Lawrence Moss, Ashish Sabharwal8713-8721

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments, using a library of 8 such semantic fragments, reveal two remarkable findings: (a) State-of-the-art models, including BERT, that are pre-trained on existing NLI benchmark datasets perform poorly on these new fragments, even though the phenomena probed here are central to the NLI task; (b) On the other hand, with only a few minutes of additional finetuning with a carefully selected learning rate and a novel variation of inoculation a BERT-based model can master all of these logic and monotonicity fragments while retaining its performance on established NLI benchmarks.
Researcher Affiliation Collaboration Allen Institute for AI, Seattle, WA, USA Indiana University, Bloomington, IN, USA {kyler, ashishs}@allenai.org, {huhai, lmoss}@indiana.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks. Figure 3 shows rule templates and labeled examples, which are descriptive rather than executable pseudocode.
Open Source Code No The paper mentions using a third-party library: 'We use the BERT-base uncased model in all experiments, as implemented in Hugging Face: https://github.com/huggingface/pytorch-pretrained-BERT.' However, it does not provide access to the authors' own source code for the methodology described in the paper.
Open Datasets Yes Progress in empirical NLI has accelerated due to the introduction of new large-scale NLI datasets, such as the Stanford Natural Language Inference (SNLI) dataset (Bowman et al. 2015) and Multi NLI (MNLI) (Williams, Nangia, and Bowman 2018)
Dataset Splits Yes For each fragment, we uniformly generated 3,000 training examples and reserved 1,000 examples for testing. ... We also reserve 1,000 for development.
Hardware Specification No The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments.
Software Dependencies No The paper mentions using 'Hugging Face: https://github.com/huggingface/pytorch-pretrained-BERT' for BERT implementation, implying PyTorch, but does not provide specific version numbers for any software libraries or dependencies (e.g., PyTorch version, Python version, Hugging Face Transformers version).
Experiment Setup No The paper mentions hyperparameter searches ('We found all models to be sensitive to learning rate, and performed comprehensive hyper-parameters searches to consider different learning rates, # iterations and (for BERT) random seeds') but does not provide the specific hyperparameter values (e.g., exact learning rates, batch sizes, number of epochs) or other system-level training settings used in the experiments.