Clustering Billions of Reads for DNA Data Storage
Authors: Cyrus Rashtchian, Konstantin Makarychev, Miklos Racz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, Karin Strauss
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We provide empirical justification of the accuracy, scalability, and convergence of our algorithm on real and synthetic data. |
| Researcher Affiliation | Collaboration | a Microsoft Research, b CSE at University of Washington, c EECS at Northwestern University, d ORFE at Princeton University |
| Pseudocode | Yes | Algorithm 1: Clustering DNA Strands |
| Open Source Code | No | The paper states, "We plan to release one of our real datasets" which refers to data, not the source code for the methodology. |
| Open Datasets | Yes | Table 1: Datasets. Real data from Organick et. al. [45]. Synthetic data from Defn. 2.3. Appendix E has details. |
| Dataset Splits | No | The paper does not specify explicit training, validation, or test data splits (e.g., percentages or absolute counts). |
| Hardware Specification | Yes | We run tests on Microsoft Azure virtual machines (size H16mr: 16 cores, 224 GB RAM, RDMA network). |
| Software Dependencies | No | The paper mentions "C++ using MPI" and "Starcode [57]" but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | For the edit distance threshold, we desire r to be just larger than the cluster diameter. With p noise, we expect the diameter to be at most 4pm with high probability. We conservatively estimate p 4% for real data, and thus we set r = 25, since 4pm = 24 for p = 0.04 and m = 150. ... On synthetic data, we found that setting θlow = 40 and θhigh = 60 leads to very reduced running time while sacrificing negligible accuracy. ... Finally, we conservatively set the number of iterations to 780 total (26 communication rounds, each with 30 local iterations) because this led to 99.9% accuracy on synthetic data (even with γ = 1.0). |