BERTology Meets Biology: Interpreting Attention in Protein Language Models

Authors: Jesse Vig, Ali Madani, Lav R. Varshney, Caiming Xiong, richard socher, Nazneen Rajani

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate a set of methods for analyzing protein Transformer models through the lens of attention. We show that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We find this behavior to be consistent across three Transformer architectures (BERT, ALBERT, XLNet) and two distinct protein datasets.
Researcher Affiliation Collaboration Jesse Vig1 Ali Madani1 Lav R. Varshney1,2 Caiming Xiong1 Richard Socher1 Nazneen Fatema Rajani1 1Salesforce Research, 2University of Illinois at Urbana-Champaign {jvig,amadani,cxiong,rsocher,nazneen.rajani}@salesforce.com varshney@illinois.edu
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code for visualization and analysis is available at https://github.com/salesforce/provis.
Open Datasets Yes For our analyses of amino acids and contact maps, we use a curated dataset from TAPE based on Protein Net (Al Quraishi, 2019; Fox et al., 2013; Berman et al., 2000; Moult et al., 2018)... For the analysis of secondary structure and binding sites we use the Secondary Structure dataset (Rao et al., 2019; Berman et al., 2000; Moult et al., 2018; Klausen et al., 2019) from TAPE. We obtained token-level binding site and protein modification labels from the Protein Data Bank (Berman et al., 2000).
Dataset Splits Yes For the diagnostic classifier, we used the respective training splits for training and the validation splits for evaluation. See Appendix B.4 for additional details. Table 2: Datasets used in analysis Dataset Train size Validation size Protein Net 25299 224 Secondary Structure 8678 2170 Binding Sites / PTM 5734 1418
Hardware Specification Yes Experiments performed on single 16GB Tesla V-100 GPU.
Software Dependencies No The paper mentions several models and tools such as BERT, ALBERT, XLNet, TAPE, ProtTrans, and NGL Viewer, but it does not specify any version numbers for these software dependencies or underlying libraries (e.g., PyTorch, TensorFlow).
Experiment Setup Yes We set the attention threshold θ to 0.3 to select for high-confidence attention while retaining sufficient data for analysis. We truncate all protein sequences to a length of 512 to reduce memory requirements.