Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Protein Structure Tokenization: Benchmarking and New Recipe

Authors: Xinyu Yuan, Zichen Wang, Marcus D. Collins, Huzefa Rangwala

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We first introduce Struct Token Bench, a framework that comprehensively evaluates the quality and efficiency of structure tokenizers, focusing on fine-grained local substructures rather than global structures, as typical in existing benchmarks. Our evaluations reveal that no single model dominates all benchmarking perspectives. ... Compared to the leading VQVAE model ESM3, our method achieves an average of 6.31% performance improvement across 24 supervised tasks, with sensitivity and utilization rates increased by 12.83% and 124.03%, respectively.
Researcher Affiliation	Collaboration	1Mila Quebec AI Institute 2University of Montreal 3Amazon. Emails: Xinyu Yuan <EMAIL>, Zichen Wang <EMAIL>, Marcus Collins <EMAIL>, Huzefa Rangwala: <EMAIL>
Pseudocode	Yes	Algorithm 1 Gram-Schmidt Process
Open Source Code	Yes	Source code and model weights are available at https://github.com/KatarinaYuan/StructTokenBench.
Open Datasets	Yes	We collected datasets from various resources: ATLAS (Vander Meersche et al., 2024), Inter Pro (Blum et al., 2024), Bio LIP2 (Zhang et al., 2024b), Protein Shake (Kucera et al., 2024), Protein GLUE (Capel et al., 2022), TAPE (Rao et al., 2019), Fold Switching (Chakravarty & Porter, 2022), Apo Holo (Salda no et al., 2022), CAMEO (Robin et al., 2021), and CASP14 (Kryshtafovych et al., 2021) (see App. A.1.1).
Dataset Splits	Yes	We split the data at ratio of 90% and 10% for training and validation sets. ... For supervised tasks evaluating downstream effectiveness, datasets are split using a remote homologous method (see App. A.2) to assess out-of-distribution generalization, which results in two test splits: fold (Fold) and superfamily (Sup Fam). ... For each fold, superfamilies are split into two groups (60% for training and 40% for testing), creating the fold test split. For the split training data, 80% of the proteins in each superfamily are placed in training, with the remaining 20% in testing, creating superfamily-level datasets. Lastly, 20% of the test data is randomly selected to form a validation set.
Hardware Specification	Yes	Adam optimizer (Kingma, 2014)(learning rate: 0.0001, weight decay: 0.01, warmup steps: 5,426) was used to train Amino Aseed for 108,530 steps on 8 NVIDIA A100 GPUs (more details in App. F.1).
Software Dependencies	No	The paper mentions software components such as "Adam optimizer (Kingma, 2014)", "Deep Speed Ze RO training stage 2 (Rajbhandari et al., 2020)", "Biotite (Kunzmann & Hamacher, 2018)", "GROMACS (Abraham et al., 2015)", and "CHARMM36m force field (Huang et al., 2017)". However, it does not provide specific version numbers for these software packages or libraries used in the experiments as required for a 'Yes' classification.
Experiment Setup	Yes	Pre-training configurations. Adam optimizer (Kingma, 2014)(learning rate: 0.0001, weight decay: 0.01, warmup steps: 5,426) was used to train Amino Aseed for 108,530 steps on 8 NVIDIA A100 GPUs (more details in App. F.1). ... A two-layer MLP probing layer is used for prediction across all tasks, trained using an Adam optimizer for 10,000 steps. During training, the structural representations extracted from PSTs either continuous or discrete were fixed. For all models on all tasks, we opted the checkpoint with the best learning rate based on validation set performance. ... we employed the same configuration for both of our implemented models Amino Aseed and Vanilla VQ. Specifically, our models were trained using an Adam optimizer with a linear warmup schedule to a peak learning rate of 0.0001, followed by cosine decay to 10% of the peak learning rate. We use a weight decay of 0.01. The training process involved 5,426 warmup steps and continued for a total of 108,530 steps... Each GPU processed a batch size of 4, without gradient accumulation, resulting in an effective global batch size of 32. ... For supervised tasks, ... a two-layer MLP with a hidden dimension of 512, Re LU nonlinearity, and a dropout layer with a dropout ratio of 0.1 between the layers. ... Training was managed using an Adam optimizer with a cosine annealed learning rate schedule, selecting peak learning rates from the set {0.1, 0.01, 0.001, 1e 4, 5e 5, 1e 5, 5e 5, 1e 6}. The best learning rate was chosen based on the best validation Macro F1 for classification tasks, and best validation Spearman s ρ for regression tasks. The training protocol included 200 warmup steps and a total of 10,000 training steps. Each experiment was conducted on a single NVIDIA A10 GPU, with a per-GPU batch size of 8 for all supervised tasks, except for the Homo task, which used a batch size of 64. ... All results reported were obtained using seed 1,234.