reproducibilityindex.ai

VAST: The Valence-Assessing Semantics Test for Contextualizing Language Models

Authors: Robert Wolfe, Aylin Caliskan11477-11485

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce VAST, the Valence-Assessing Semantics Test, a novel intrinsic evaluation task for contextualized word embeddings (CWEs). ... GPT-2 results show that the semantics of a word incorporate the semantics of context in layers closer to model output, such that VAST scores diverge between our contextual settings, ranging from Pearson s ρ of .55 to .77 in layer 11. ... We find that a few neurons with values having greater magnitude than the rest mask word-level semantics in GPT-2 s top layer, but that word-level semantics can be recovered by nullifying non-semantic principal components: Pearson s ρ in the top layer improves from .32 to .76.
Researcher Affiliation	Academia	Robert Wolfe and Aylin Caliskan University of Washington rwolfe3@uw.edu, aylin@uw.edu
Pseudocode	No	The VAST algorithm follows: 1. Select a contextual setting (random, bleached, aligned, or misaligned), subword representation (first, last, mean, or max), LM, language, and valence lexicon. 2. Obtain a CWE from every layer of the LM for every word in a valence lexicon in the selected contextual setting, using the selected subword representation. If using the misaligned setting, obtain CWEs for polar words in the aligned setting. See the appendix for details about the misaligned setting. 3. Compute the SC-WEAT effect size for the CWE from each layer of the LM for every word in the lexicon, using CWEs from the same layer for the polar attribute words in the selected contextual setting. If using the misaligned setting, use the polar word CWEs from the aligned setting. 4. Take Pearson s ρ for each layer of SC-WEAT effect sizes vs. valence scores from the lexicon to measure how well LM semantics reflect widely shared human valence norms. 5. Repeat the steps above in different contextual settings, using different subword representations, to derive insights about the semantic encoding and contextualization process.
Open Source Code	Yes	Our code is available at https://github. com/wolferobert3/vast aaai 2022.
Open Datasets	Yes	Reddit Corpus We randomly select one context per word from the Reddit corpus of Baumgartner et al. (2020), which better reflects everyday human speech than the expository language found in sources like Wikipedia. Valence Lexica VAST measures valence against the human-rated valence scores in Bellezza s lexicon, Affective Norms for English Words (ANEW), and Warriner s lexicon. Word Similarity Tasks ... Word Sim-353 (WS-353) ... Sim Lex-999 (SL-999) ... Stanford Rare Words (RW) ... MEN Test Collection task... Corpus of Linguistic Acceptability (Co LA) (Warstadt, Singh, and Bowman 2019).
Dataset Splits	No	The paper describes using various datasets for evaluation (e.g., Co LA test sentences), but does not explicitly provide the specific training/validation/test split percentages, sample counts, or refer to a standard split methodology required for reproduction.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud computing instance types) used for conducting the experiments.
Software Dependencies	No	The paper mentions using 'Transformers library' and specific language models like 'GPT-2', 'XLNet', 'BERT', and 'RoBERTa', as well as the 'NLTK tagger', but does not specify version numbers for these software components.
Experiment Setup	No	The paper describes the VAST algorithm and contextual settings but does not provide specific hyperparameters (e.g., learning rate, batch size, epochs) or detailed training configurations for any models or experiments, such as the logistic regression for Co LA sentences or POS tagging.