Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Robust Evaluation Measures for Evaluating Social Biases in Masked Language Models
Authors: Yang Liu
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on the publicly available datasets Stereo Set (SS) and Crow S-Pairs (CP) show that our proposed measures are significantly more robust and interpretable than those proposed previously. |
| Researcher Affiliation | Academia | Tianjin University EMAIL |
| Pseudocode | No | The paper describes its methods using mathematical formulas and textual explanations but does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All experiments were conducted on a Ge Force RTX 3070 GPU and the code is available on Git Hub4. |
| Open Datasets | Yes | Our experiments use publicly available Stereo Set (SS; Nadeem, Bethke, and Reddy 2021)5 and Crow S-Pairs (CP; Nangia et al. 2020)6 datasets. |
| Dataset Splits | Yes | Because the test set part of the SS dataset is not publicly available, we use its development set. |
| Hardware Specification | Yes | All experiments were conducted on a Ge Force RTX 3070 GPU and the code is available on Git Hub4. |
| Software Dependencies | No | The paper mentions models like BERT, RoBERTa, and ALBERT, but it does not specify versions for software dependencies such as deep learning frameworks (e.g., PyTorch, TensorFlow) or other relevant libraries. |
| Experiment Setup | No | The paper mentions the language models used (BERT, RoBERTa, ALBERT) and the datasets, but it does not provide specific experimental setup details such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings. |