Adjective Scale Probe: Can Language Models Encode Formal Semantics Information?
Authors: Wei Liu, Ming Xiang, Nai Ding
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we propose a diagnostic dataset to investigate how well language models understand the degree semantics of adjectives. In the dataset, referred as the Adjective Scale Probe (ASP), we semi-automatically generate 8 tests of Natural Language Inference (NLI) questions to test 8 key capabilities of adjective interpretation. We apply the ASP dataset to evaluate the performance of 3 language models, i.e., BERT, De BERTa, and T0. It is found that language models perform below the majority baseline for most tests of the ASP, even when the models have been fine-tuned to achieve high performance on the large-scale MNLI dataset. But after we fine-tune the pre-trained models on a subset of the ASP, De BERTa can achieve high performance on the untrained adjectives and untrained tests, suggesting that De BERTa may have captured degree semantic information of adjectives through pre-training but it needs specific training data to learn how to apply such information to the current tasks. |
| Researcher Affiliation | Academia | 1College of Biomedical Engineering and Instrument Sciences, Zhejiang University 2Department of Linguistics, The University of Chicago |
| Pseudocode | No | No section or figure labeled 'Pseudocode' or 'Algorithm' is present in the paper, nor are there structured algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | No | The paper describes the creation of the Adjective Scale Probe (ASP) dataset but does not provide concrete access information (e.g., link, DOI, or formal citation with authors/year) for its public availability. |
| Dataset Splits | Yes | For the entailment inference task, we split the adjective vocabulary into training/testing set before data generation. Each time, we used 50% of the adjectives to construct the training set, and left the remaining half of the adjectives for testing. For the degree estimation task, each time, we used 3 physical dimensions for training, e.g., length, mass, and price, and the remaining dimension, e.g., temperature, was used for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions models like BERT and DeBERTa but does not provide specific version numbers for these or any underlying software libraries or dependencies used for the experiments. |
| Experiment Setup | No | The paper states that 'The fine-tuning parameters on the ASP were shown in Appendix Table 2,' indicating that specific experimental setup details like hyperparameters are not included in the main text. |