Debiasing Attention Mechanism in Transformer without Demographics
Authors: Shenyu Lu, Yipei Wang, Xiaoqian Wang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments in computer vision and natural language processing tasks and show that our method is comparable and even outperforms the state-of-the-art method with substantially lower energy consumption. We conduct extensive experiments on real-world datasets, encompassing various classification tasks in computer vision and natural language processing (NLP) fields. |
| Researcher Affiliation | Academia | Shenyu Lu, Yipei Wang & Xiaoqian Wang Elmore Family School of Electrical and Computer Engineering Purdue University West Lafayette, IN 47906, USA {lu876,wang4865,joywang}@purdue.edu |
| Pseudocode | Yes | We summarized our method in an algorithm, detailed in Appendix G. Algorithm 1 Debias Attention mechanism |
| Open Source Code | Yes | To reproduce our experiment, we have made the code available at https://github.com/lu876/Debiasing-Attention-Mechanism-in-Transformer-without-Demographics. |
| Open Datasets | Yes | We test all methods on two real-world datasets: Celeb A (Liu et al., 2015), and UTK (Zhang & Qi, 2017). utilizing both the Hate Xplain (Mathew et al., 2021) and Multi NLI (Williams et al., 2017) datasets. |
| Dataset Splits | No | The paper mentions using a 'validation set' for hyperparameter tuning and model selection, such as 'We save the model that achieves the highest validation accuracy.' (Appendix E), but does not explicitly provide the split percentages or sample counts for the training, validation, and test sets. |
| Hardware Specification | Yes | We train all methods on a single NVIDIA RTX-3090 GPU with 24576 Mi B memory. |
| Software Dependencies | No | The paper mentions using the 'Huggingface library' and 'Adam W' optimizer but does not specify their version numbers or other software dependencies with specific versions. |
| Experiment Setup | Yes | For Celeb A and UTK, we take the Adam W as the optimizer with a learning rate of 10 4, and no scheduler is applied for the fair comparison. For NLP tasks, we take the Adam W as the optimizer with a learning rate of 10 5. We share all methods with the same batch size and optimizer configuration. We tune the hyper-parameter η at the validation set to achieve the highest accuracy. For Celeb A and UTK experiments, we set η = 0.15 and η = 0.10 respectively. For Hate Xplain and Multi NLI, we set η = 0.25. |