reproducibilityindex.ai

Clustering Billions of Reads for DNA Data Storage

Authors: Cyrus Rashtchian, Konstantin Makarychev, Miklos Racz, Siena Ang, Djordje Jevdjic, Sergey Yekhanin, Luis Ceze, Karin Strauss

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We provide empirical justiﬁcation of the accuracy, scalability, and convergence of our algorithm on real and synthetic data.
Researcher Affiliation	Collaboration	a Microsoft Research, b CSE at University of Washington, c EECS at Northwestern University, d ORFE at Princeton University
Pseudocode	Yes	Algorithm 1: Clustering DNA Strands
Open Source Code	No	The paper states, "We plan to release one of our real datasets" which refers to data, not the source code for the methodology.
Open Datasets	Yes	Table 1: Datasets. Real data from Organick et. al. [45]. Synthetic data from Defn. 2.3. Appendix E has details.
Dataset Splits	No	The paper does not specify explicit training, validation, or test data splits (e.g., percentages or absolute counts).
Hardware Specification	Yes	We run tests on Microsoft Azure virtual machines (size H16mr: 16 cores, 224 GB RAM, RDMA network).
Software Dependencies	No	The paper mentions "C++ using MPI" and "Starcode [57]" but does not provide specific version numbers for these software components.
Experiment Setup	Yes	For the edit distance threshold, we desire r to be just larger than the cluster diameter. With p noise, we expect the diameter to be at most 4pm with high probability. We conservatively estimate p 4% for real data, and thus we set r = 25, since 4pm = 24 for p = 0.04 and m = 150. ... On synthetic data, we found that setting θlow = 40 and θhigh = 60 leads to very reduced running time while sacriﬁcing negligible accuracy. ... Finally, we conservatively set the number of iterations to 780 total (26 communication rounds, each with 30 local iterations) because this led to 99.9% accuracy on synthetic data (even with γ = 1.0).