reproducibilityindex.ai

Adjective Scale Probe: Can Language Models Encode Formal Semantics Information?

Authors: Wei Liu, Ming Xiang, Nai Ding

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Here, we propose a diagnostic dataset to investigate how well language models understand the degree semantics of adjectives. In the dataset, referred as the Adjective Scale Probe (ASP), we semi-automatically generate 8 tests of Natural Language Inference (NLI) questions to test 8 key capabilities of adjective interpretation. We apply the ASP dataset to evaluate the performance of 3 language models, i.e., BERT, De BERTa, and T0. It is found that language models perform below the majority baseline for most tests of the ASP, even when the models have been fine-tuned to achieve high performance on the large-scale MNLI dataset. But after we fine-tune the pre-trained models on a subset of the ASP, De BERTa can achieve high performance on the untrained adjectives and untrained tests, suggesting that De BERTa may have captured degree semantic information of adjectives through pre-training but it needs specific training data to learn how to apply such information to the current tasks.
Researcher Affiliation	Academia	1College of Biomedical Engineering and Instrument Sciences, Zhejiang University 2Department of Linguistics, The University of Chicago
Pseudocode	No	No section or figure labeled 'Pseudocode' or 'Algorithm' is present in the paper, nor are there structured algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets	No	The paper describes the creation of the Adjective Scale Probe (ASP) dataset but does not provide concrete access information (e.g., link, DOI, or formal citation with authors/year) for its public availability.
Dataset Splits	Yes	For the entailment inference task, we split the adjective vocabulary into training/testing set before data generation. Each time, we used 50% of the adjectives to construct the training set, and left the remaining half of the adjectives for testing. For the degree estimation task, each time, we used 3 physical dimensions for training, e.g., length, mass, and price, and the remaining dimension, e.g., temperature, was used for testing.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU/CPU models, memory) used for running the experiments.
Software Dependencies	No	The paper mentions models like BERT and DeBERTa but does not provide specific version numbers for these or any underlying software libraries or dependencies used for the experiments.
Experiment Setup	No	The paper states that 'The fine-tuning parameters on the ASP were shown in Appendix Table 2,' indicating that specific experimental setup details like hyperparameters are not included in the main text.