On Pre-training Language Model for Antibody
Authors: Danqing Wang, Fei YE, Hao Zhou
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To investigate the problem, we aim to answer several key questions in this paper, such as how pre-trained language models perform in antibody tasks with different specificity and how introducing specific biological mechanisms to the pre-training process can benefit the model. Additionally, we evaluate if the learned antibody pre-trained representations can be applied to real-world antibody problems, like drug discovery and immune process understanding. Previously, no benchmark available largely hindered the study to answer these questions. To aid in our investigation, we provide an An Tibody Understanding Evaluation (ATUE) benchmark. We comprehensively evaluate the performance of protein pre-trained language models by empirical study along with conclusions and new insights. |
| Researcher Affiliation | Collaboration | Danqing Wang1,2 , Fei Ye1, Zhou Hao3 1Byte Dance Research, Shanghai, China 2University of California, Santa Barbara 3Insititute for AI Industry Research, Tsinghua University |
| Pseudocode | No | The paper describes methods and processes in text and figures but does not contain any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | Yes | Our ATUE and code are released at https://github.com/dqwang122/EATLM. |
| Open Datasets | Yes | To aid in our investigation, we provide an An Tibody Understanding Evaluation (ATUE) benchmark. ... All data are publicly open and used under the right license. ... We collect the antigen binding data from Mason et al. (2021)... The paratope data is collected from Liberis et al. (2018)... We collect 88,094 sequences from Mroczek et al. (2014). ... We collected antibody sequences from 133 SARS-Co V-2 patients and 87 health persons from OAS and followed the processing pipeline of Kim et al. (2021). |
| Dataset Splits | Yes | We collect the antigen binding data from Mason et al. (2021) and follow the training/validation/test split of 15,128/3,242/3,242. ... For other tasks that do not provide a standard split, we use a 10-fold cross-validation. ... We conduct 10-fold validation on paratope prediction, B cell maturation analysis, and antibody discovery. For antigen binding prediction, we conduct three repetitive experiments with different random seeds. |
| Hardware Specification | No | The paper does not explicitly mention any specific hardware details such as GPU models (e.g., NVIDIA A100, RTX 2080 Ti), CPU models, or cloud computing instance types used for running the experiments. It only vaguely mentions 'Work was done when Danqing Wang was in Bytedance Research'. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer (Kingma & Ba, 2015)' and 'base Transformer architecture (Vaswani et al., 2017)' but does not specify any software versions for programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or other libraries/dependencies (e.g., CUDA). |
| Experiment Setup | Yes | We use the base Transformer architecture (Vaswani et al., 2017) with 12 layers, 12 heads, and 768 hidden states. The total parameters are 86M. We use Adam optimizer (Kingma & Ba, 2015) with a maximum learning rate of 2e-4 and a warm-up step of 24,000. The maximum length is set to 400 since most antibody sequences are shorter than 180. We first pre-train our model with the MLM objective. During the pre-training, 15% tokens are randomly selected with 80% masked, 10% replaced, and 10% kept. Then we conduct further pre-training on two antibody-related tasks with a smaller learning rate of 1e-5. ... For finetuning, we limit the max epochs to 30 and use the Adam optimizer with a max learning rate of 3e-5. |