Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability
Authors: Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we present a first exploration of transforming decoder-only models to B-cos LMs for generation tasks. |
| Researcher Affiliation | Academia | Yifan Wang EMAIL Saarland University, Saarbrücken, Germany Sukrut Rao EMAIL Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany Ji-Ung Lee EMAIL Saarland University, Saarbrücken, Germany Mayank Jobanputra EMAIL Saarland University, Saarbrücken, Germany Vera Demberg EMAIL Saarland University, Saarbrücken, Germany Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany |
| Pseudocode | No | The paper describes methodologies in text and mathematical formulations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Ewanwong/bcos_lm. |
| Open Datasets | Yes | Our experiments use three datasets: AG News (topic classification, Zhang et al., 2015), IMDB (sentiment analysis, Maas et al., 2011), and Hate Xplain (hate speech detection, Mathew et al., 2021). ... we use the BLi MP dataset (Warstadt et al., 2020) to assess explanations for linguistic phenomena, and the Indirect Object Identification (IOI) dataset (Brian Muhia, 2022) to test models reasoning about object identification. |
| Dataset Splits | Yes | For validation, we randomly sample half of the test set from IMDB and AG News. ... For faithfulness evaluation, we conduct perturbation-based evaluations on 2000 test examples and Seq PG on 500 test examples for AG News and IMDB. For Hate Xplain, we use the full test set for perturbation-based evaluation (1,924 examples) and construct 269, 310, and 308 Seq PG examples from it using BERT, Distil BERT, and Ro BERTa, respectively. |
| Hardware Specification | Yes | Unless stated otherwise, all experiments are conducted on a single NVIDIA H100 GPU. |
| Software Dependencies | No | For all PLMs used in the experiments, we use the uncased base version from huggingface (Wolf et al., 2020). For Ix G and Shap Sampl, we use the Captum (Kokhlikyan et al., 2020) implementations. We implement the Attention method ourselves, and LIME is sourced from the lit library. For Decomp X and SIG, we use their official implementations with default configurations. ... For Saloss models, we use the official codebase with default hyperparameters to train BERT and Ro BERTa on AG News, IMDB, and Hate Xplain. |
| Experiment Setup | Yes | For both conventional models and B-cos LMs, we train them for 5 epochs with 10% linear warm-up steps on the downstream task datasets. The learning rates are set to 2e-5 for IMDB and Hate Xplain, and 3e-5 for AG News. All models use a batch size of 16 and a maximum sequence length of 512. For validation, we randomly sample half of the test set from IMDB and AG News. We set B=1.25 for IMDB and B=1.5 for AG News and Hate Xplain datasets. |