reproducibilityindex.ai

A Kernelized Stein Discrepancy for Biological Sequences

Authors: Alan Nawzad Amin, Eli N Weinstein, Debora Susan Marks

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the advantages of the KSD-B on problems with synthetic and real data, and apply it to measure the fit of state-of-the-art machine learning models. Overall, the KSD-B enables rigorous evaluation of generative biological sequence models, allowing the accuracy of models, sampling procedures, and library designs to be checked reliably.
Researcher Affiliation	Academia	1Harvard Medical School 2Columbia University 3Broad Institute of Harvard and MIT.
Pseudocode	No	The paper describes methods algorithmically but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The supplementary code (https://github.com/Alan Nawzad Amin/KSD-B/) provides a Jupyter notebook (KSD-B theory example.ipynb) recreating Fig. 1(a) and 1(b) using the IMQ-H (U), IMQ-H (N), and IMQ-H+Exp-H kernels.
Open Datasets	Yes	First we downloaded 115 thousand CDR3 protein sequences varying in length from 10 to 27 from patient 1 from 10x Genomics (2022).
Dataset Splits	No	The paper mentions holding out 20% of data for testing, but it does not provide specific details for a validation split (percentages, counts, or cross-validation scheme) to reproduce data partitioning for development and tuning.
Hardware Specification	No	No specific hardware details (such as GPU/CPU models, processor types, or memory specifications) used for running experiments were provided in the paper.
Software Dependencies	No	The paper mentions various software components and repositories like "pyro-ppl/pyro", "debbiemarkslab/plmc", and "Jupyter notebook", but it does not provide specific version numbers for these software or their underlying dependencies to ensure reproducibility.
Experiment Setup	Yes	In every case we used λ = 1/5 for the Exp-H kernel, β = 1/2 for the IMQ-H kernel, ζ = log \|B\| and µ = 0.2 for the alignment kernel, and ϵ = 0.2 for the tilting parameter of the alignment kernel. We set C = 1 for the IMQ-H kernel when in a vector field kernel and C = 3 when in a scalar field kernel. For embedding kernels, we set the bandwidth parameter σ to be the median distance between rescaled embeddings.