Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models

Authors: Yang Liu

AAAI 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the publicly available datasets Stereo Set (SS) and Crow S-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously.
Researcher Affiliation Academia Tianjin University EMAIL
Pseudocode No The paper describes its methods using mathematical formulas and textual explanations but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes All experiments were conducted on a Ge Force RTX 3070 GPU and the code is available on Git Hub4.
Open Datasets Yes Our experiments use publicly available Stereo Set (SS; Nadeem, Bethke, and Reddy 2021)5 and Crow S-Pairs (CP; Nangia et al. 2020)6 datasets.
Dataset Splits Yes Because the test set part of the SS dataset is not publicly available, we use its development set.
Hardware Specification Yes All experiments were conducted on a Ge Force RTX 3070 GPU and the code is available on Git Hub4.
Software Dependencies No The paper mentions models like BERT, RoBERTa, and ALBERT, but it does not specify versions for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or other relevant libraries.
Experiment Setup No The paper mentions the language models used (BERT, RoBERTa, ALBERT) and the datasets, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings.