Clustering Billions of Reads for DNA Data Storage

Authors: Cyrus Rashtchian, Konstantin Makarychev, Miklos Racz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, Karin Strauss

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data.
Researcher Affiliation Collaboration a Microsoft Research, b CSE at University of Washington, c EECS at Northwestern University, d ORFE at Princeton University
Pseudocode Yes Algorithm 1: Clustering DNA Strands
Open Source Code No The paper states, "We plan to release one of our real datasets" which refers to data, not the source code for the methodology.
Open Datasets Yes Table 1: Datasets. Real data from Organick et. al. [45]. Synthetic data from Defn. 2.3. Appendix E has details.
Dataset Splits No The paper does not specify explicit training, validation, or test data splits (e.g., percentages or absolute counts).
Hardware Specification Yes We run tests on Microsoft Azure virtual machines (size H16mr: 16 cores, 224 GB RAM, RDMA network).
Software Dependencies No The paper mentions "C++ using MPI" and "Starcode [57]" but does not provide specific version numbers for these software components.
Experiment Setup Yes For the edit distance threshold, we desire r to be just larger than the cluster diameter. With p noise, we expect the diameter to be at most 4pm with high probability. We conservatively estimate p 4% for real data, and thus we set r = 25, since 4pm = 24 for p = 0.04 and m = 150. ... On synthetic data, we found that setting θlow = 40 and θhigh = 60 leads to very reduced running time while sacrificing negligible accuracy. ... Finally, we conservatively set the number of iterations to 780 total (26 communication rounds, each with 30 local iterations) because this led to 99.9% accuracy on synthetic data (even with γ = 1.0).