Training Compute-Optimal Protein Language Models

Authors: Xingyi Cheng, Bo Chen, Pan Li, Jing Gong, Jie Tang, Le Song

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We trained over 300 models ranging from 3.5 million to 10.7 billion parameters on 5 to 200 billion unique tokens, to investigate the relations between model sizes, training token numbers, and objectives.
Researcher Affiliation Collaboration Xingyi Cheng1 , Bo Chen2 , Pan Li1, Jing Gong1, Jie Tang2, Le Song1,3 1Bio Map Research 2Tsinghua University 3MBZUAI
Pseudocode No The paper describes its methodologies using prose and mathematical equations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code: https://github.com/cxysteven/Scaling Protein LM
Open Datasets Yes Our investigation is grounded in a massive dataset consisting of 939 million protein sequences. We revisited the protein sequence data used for training PLMs and collected a dataset of 194 billion unique tokens on 939M unique sequences from publicly available sources to address the issue of overfitting and perform plateau in protein language modeling.
Dataset Splits No The paper frequently mentions using 'validation loss' and 'IID validation subset' (e.g., 'We focus on the Independent and Identically... Distributed (IID) validation and Out-Of-Distribution (OOD) test PPL'). However, it does not specify the exact percentages, sample counts, or specific methodology for creating these validation splits, nor does it reference predefined splits with citations for these specific partitions.
Hardware Specification Yes We conducted all experiments using Ampere A100 GPUs (80G) equipped with NVLink
Software Dependencies No The paper mentions several software frameworks and components used, such as 'GLM framework', 'Deep Speed', 'Megatron', 'Transformer architecture', 'Deep Norm', 'Flash Attention', and 'Adam W optimizer'. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup Yes The used max LR empirically found to range between 6 10 4 and 1.2 10 4 from small to large model size, was used along with a cosine decay strategy to reduce it to 0.1 max LR. Both CLM and MLM were trained under similar settings for model size, with a consistent LR and a minimum warm-up period of 2.5% steps, extending to at least 100K training steps. All sequences were set to a length of 1024, with sequences concatenated using an <EOS> delimiter. ... The Adam W optimizer [50] was used with β1 = 0.9, β2 = 0.95, ϵ = 1 10 8, and a weight decay of 0.01. All experiments omitted the dropout (it reduced the capacity to hinder model scaling) and trained with bfloat16.