Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Metalic: Meta-Learning In-Context with Protein Language Models

Authors: Jacob Beck, Shikha Surana, Manus McAuliffe, Oliver Bent, Thomas Barrett, Juan Jose Garau-Luis, Paul Duckworth

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate Metalic on fitness prediction tasks from the Protein Gym benchmark (Notin et al., 2024). We evaluate in the zero-shot setting with no support data, and the few-shot setting with limited support data. To establish SOTA results in the zero-shot setting, we first compare to the predictions provided by Protein Gym for the baseline models. To establish strong performance in the few-shot setting, since predictions are not provided, we train baselines from Hawkins-Hooker et al. (2024). While Metalic does not achieve SOTA results in evaluations on proteins that have multiple mutations (multi-mutant proteins), we demonstrate that the performance grows as we add more meta-training tasks, providing a path forward for Metalic in the multi-mutant setting in the future. We perform ablations of Metalic, to show the benefits of meta-learning, in-context learning, and fine-tuning. Finally, we compare to the gradient-based method, Reptile (Nichol et al., 2018), to show that taking account of gradients during training is an unnecessary computational burden.
Researcher Affiliation	Collaboration	Jacob EMAIL Insta Deep \| https://github.com/instadeepai/metalic Boston, MA, USA & London, UK
Pseudocode	No	The paper describes the methodology using textual explanations, mathematical equations, and architectural diagrams (e.g., Figure 2, Figure 6). It does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Insta Deep \| https://github.com/instadeepai/metalic
Open Datasets	Yes	In our experiments, we focus on Protein Gym deep mutational scans. Each task in Protein Gym each measures one property on a set of proteins that all differ by one amino acid, or multiple amino acids, from a reference wild-type protein. We have 121 single-mutant tasks and 68 multi-mutant tasks from Protein Gym. From these, we evaluate over eight held-out single-mutant tasks, and five heldout multi-mutant tasks, following Notin et al. (2023); Hawkins-Hooker et al. (2024).
Dataset Splits	Yes	We have 121 single-mutant tasks and 68 multi-mutant tasks from Protein Gym. From these, we evaluate over eight held-out single-mutant tasks, and five heldout multi-mutant tasks, following Notin et al. (2023); Hawkins-Hooker et al. (2024). This leaves 113 single-mutant and 68 multi-mutant tasks for training when evaluating single mutants (Sections 4.2 and 4.3), and 121 single-mutant and 63 multi-mutant tasks for training when evaluating multiple mutants (Section 4.4). All fitness values are standardized by subtracting the mean and dividing by the standard deviation by task. We use a query set size of N (Q) = 100, and the size of the support set is determined by our evaluation setting and is one of three sizes: N (S) = 0, 16, or 128.
Hardware Specification	Yes	Using a single Nvidia A100-80Gb, training our model takes roughly 2 to 8 days per seed, depending on support size and frequency of fine-tuned evaluation.
Software Dependencies	No	The paper mentions specific models like ESM2-8M and ESM-IF1, and optimizers like Adam, but does not provide version numbers for general software dependencies such as Python, PyTorch, TensorFlow, or CUDA, which are necessary for full reproducibility.
Experiment Setup	Yes	Table 7: Hyper-Parameters for Metalic. Where different, an earlier set of hyper-parameters used for ablations and Reptile comparisons are given in parentheses. Hyper-Parameter Description Value Training Steps The total number of training steps in meta-training 50,000 Warm-Up Steps The number of training steps spent linearly warming up, preceding cosine decay 5,000 Batch Size The number of contexts evaluated per training step. Note that gradient accumulation is used for each context in the batch, so this scales training time linearly. 1 Weight Decay Weight decay applied to non-bias parameters only 5e-3 Learning Rate The learning rate for meta-training and fine-tuning 6e-5 Min LR Fraction The minimum fraction of the LR maintained during the cosine decay in learning rate scheduling 0.1 Adam Eps The epsilon value for the Adam optimizer 1e-8 Adam Beta1 The beta1 value for the Adam optimizer 0.9 Adam Beta2 The beta2 value for the Adam optimizer 0.999 Gradient Clip Value The maximum norm allowed for the gradient 1.0 ESM embed model The full name for the ESM2 model used esm t6 8M UR50D ESM embed layer The layer from the ESM2 model used as an embedding 3 Number Fine-tune Steps The number of gradient updates for fine-tuning after meta-training 100 Num Protein NPT Layers The number of layers using axial attention, as in Protein NPT 6 Condition on Pooled Sequence Whether each sequence is pooled or ignored after axial attention True MLP Layer Sizes The number and size of fully connected layers after axial attention [768,768,768,768] Embed Dim The embedding dimension for all inputs including the protein sequences and fitness values 768 Axial Forward Embed Dim The hidden size of the feed-forward layer within the Protein NPT layer 3072 Attention Heads The number of heads in self-attention 4 Dropout Prob The probability of dropout during training and finetuning for layers other than axial attention 0.1 Attention Dropout The probability of dropout during training and finetuning for axial attention layers 0.1 Num Single Tasks The total number of single-mutant tasks available. These tasks are included for meta-training even when testing on multi-mutants. Eight are held-out for evaluation when evaluating single-mutant performance. 121 Num Multiple Tasks The total number of multi-mutant tasks available. These tasks are included for meta-training even when testing on single-mutants. Five are held-out for evaluation when evaluating multi-mutant performance. 68 Warm-up During Finetuning Whether or not to use the linear warm-up from the learning rate scheduler during fine-tuning. False