A Kernelized Stein Discrepancy for Biological Sequences
Authors: Alan Nawzad Amin, Eli N Weinstein, Debora Susan Marks
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the advantages of the KSD-B on problems with synthetic and real data, and apply it to measure the fit of state-of-the-art machine learning models. Overall, the KSD-B enables rigorous evaluation of generative biological sequence models, allowing the accuracy of models, sampling procedures, and library designs to be checked reliably. |
| Researcher Affiliation | Academia | 1Harvard Medical School 2Columbia University 3Broad Institute of Harvard and MIT. |
| Pseudocode | No | The paper describes methods algorithmically but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The supplementary code (https://github.com/Alan Nawzad Amin/KSD-B/) provides a Jupyter notebook (KSD-B theory example.ipynb) recreating Fig. 1(a) and 1(b) using the IMQ-H (U), IMQ-H (N), and IMQ-H+Exp-H kernels. |
| Open Datasets | Yes | First we downloaded 115 thousand CDR3 protein sequences varying in length from 10 to 27 from patient 1 from 10x Genomics (2022). |
| Dataset Splits | No | The paper mentions holding out 20% of data for testing, but it does not provide specific details for a validation split (percentages, counts, or cross-validation scheme) to reproduce data partitioning for development and tuning. |
| Hardware Specification | No | No specific hardware details (such as GPU/CPU models, processor types, or memory specifications) used for running experiments were provided in the paper. |
| Software Dependencies | No | The paper mentions various software components and repositories like "pyro-ppl/pyro", "debbiemarkslab/plmc", and "Jupyter notebook", but it does not provide specific version numbers for these software or their underlying dependencies to ensure reproducibility. |
| Experiment Setup | Yes | In every case we used λ = 1/5 for the Exp-H kernel, β = 1/2 for the IMQ-H kernel, ζ = log |B| and µ = 0.2 for the alignment kernel, and ϵ = 0.2 for the tilting parameter of the alignment kernel. We set C = 1 for the IMQ-H kernel when in a vector field kernel and C = 3 when in a scalar field kernel. For embedding kernels, we set the bandwidth parameter σ to be the median distance between rescaled embeddings. |