Natural Language Inference in Context – Investigating Contextual Reasoning over Long Texts
Authors: Hanmeng Liu, Leyang Cui, Jian Liu, Yue Zhang13388-13396
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results show that state-of-the-art language models perform by far worse than educated humans. Our dataset can also serve as a testing-set for downstream tasks like checking the factual correctness of summaries. We evaluate the state-of-the-art NLI models to establish baseline performances for Con TRo L. Experimental results demonstrate a significant gap between machine and human ceiling performance. |
| Researcher Affiliation | Academia | 1 Zhejiang University, 2 Fudan University, 3 Westlake University |
| Pseudocode | No | The paper describes model structures and implementation details but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | Our dataset and results are released at https://github.com/csitfun/ConTRoL-dataset. (The link points to the dataset and results, not explicitly the source code for the methodology.) |
| Open Datasets | Yes | Our dataset and results are released at https://github.com/csitfun/ConTRoL-dataset. |
| Dataset Splits | Yes | We randomly split the dataset into training, development, and test set with the ratio of 8:1:1. |
| Hardware Specification | No | The paper mentions Transformer-based models and their token limits but does not provide specific hardware details like GPU/CPU models or processor types used for experiments. |
| Software Dependencies | No | The paper mentions models like BERT, RoBERTa, Longformer, and BART but does not provide specific version numbers for software dependencies or libraries used in the implementation. |
| Experiment Setup | Yes | All models are trained for 10 epochs. We find hyper-parameters using grid search: batch size {8, 16, 32} learning rate {1e 5, 2e 5, 3e 5, 4e 5, 5e 5} and gradient accumulate step {1, 2, 4}. We set the max length to 512 tokens for all models except Longformer, of which 3,000 tokens are the max length we take. |