Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models

Authors: Yang Liu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the publicly available datasets Stereo Set (SS) and Crow S-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously.
Researcher Affiliation Academia Tianjin University lauyon@tju.edu.cn
Pseudocode No The paper describes its methods using mathematical formulas and textual explanations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes All experiments were conducted on a Ge Force RTX 3070 GPU and the code is available on Git Hub4.
Open Datasets Yes Our experiments use publicly available Stereo Set (SS; Nadeem, Bethke, and Reddy 2021)5 and Crow S-Pairs (CP; Nangia et al. 2020)6 datasets.
Dataset Splits Yes Because the test set part of the SS dataset is not publicly available, we use its development set.
Hardware Specification Yes All experiments were conducted on a Ge Force RTX 3070 GPU and the code is available on Git Hub4.
Software Dependencies No The paper mentions models like BERT, RoBERTa, and ALBERT, but it does not specify versions for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or other relevant libraries.
Experiment Setup No The paper mentions the language models used (BERT, RoBERTa, ALBERT) and the datasets, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings.