A Kernelized Stein Discrepancy for Biological Sequences

Authors: Alan Nawzad Amin, Eli N Weinstein, Debora Susan Marks

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the advantages of the KSD-B on problems with synthetic and real data, and apply it to measure the fit of state-of-the-art machine learning models. Overall, the KSD-B enables rigorous evaluation of generative biological sequence models, allowing the accuracy of models, sampling procedures, and library designs to be checked reliably.
Researcher Affiliation Academia 1Harvard Medical School 2Columbia University 3Broad Institute of Harvard and MIT.
Pseudocode No The paper describes methods algorithmically but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The supplementary code (https://github.com/Alan Nawzad Amin/KSD-B/) provides a Jupyter notebook (KSD-B theory example.ipynb) recreating Fig. 1(a) and 1(b) using the IMQ-H (U), IMQ-H (N), and IMQ-H+Exp-H kernels.
Open Datasets Yes First we downloaded 115 thousand CDR3 protein sequences varying in length from 10 to 27 from patient 1 from 10x Genomics (2022).
Dataset Splits No The paper mentions holding out 20% of data for testing, but it does not provide specific details for a validation split (percentages, counts, or cross-validation scheme) to reproduce data partitioning for development and tuning.
Hardware Specification No No specific hardware details (such as GPU/CPU models, processor types, or memory specifications) used for running experiments were provided in the paper.
Software Dependencies No The paper mentions various software components and repositories like "pyro-ppl/pyro", "debbiemarkslab/plmc", and "Jupyter notebook", but it does not provide specific version numbers for these software or their underlying dependencies to ensure reproducibility.
Experiment Setup Yes In every case we used λ = 1/5 for the Exp-H kernel, β = 1/2 for the IMQ-H kernel, ζ = log |B| and µ = 0.2 for the alignment kernel, and ϵ = 0.2 for the tilting parameter of the alignment kernel. We set C = 1 for the IMQ-H kernel when in a vector field kernel and C = 3 when in a scalar field kernel. For embedding kernels, we set the bandwidth parameter σ to be the median distance between rescaled embeddings.