One-Pass Diversified Sampling with Application to Terabyte-Scale Genomic Sequence Streams
Authors: Benjamin Coleman, Benito Geordie, Li Chou, R. A. Leo Elworth, Todd Treangen, Anshumali Shrivastava
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We apply this method to several genomic data analysis tasks and demonstrate significant speedup in downstream analysis without sacrificing the quality of the results. |
| Researcher Affiliation | Collaboration | 1Department of Electrical and Computer Engineering, Rice University, Houston TX, USA 2Department of Computer Science, Rice University, Houston, TX, USA 3Department of Engineering and Computer Science, West Texas A&M University, Canyon TX, USA 4Third AI, Houston TX, USA. |
| Pseudocode | Yes | Algorithm 1 Diversified Sampling |
| Open Source Code | Yes | The diversified sampling algorithm is available as a web app that runs in the browser and as an open-source command line tool. To avoid violating double-blind review, we have included the repository as a zip file and deployed the web app to the (anonymous) URL: Lc28k XQtq H.github.io. |
| Open Datasets | Yes | We downloaded real datasets directly from the SRA, ENA, and Uni Ref archives... Table 2 shows the run accession numbers and properties of the datasets used in our evaluation... We also compare against the official Diginorm release with default settings (Crusoe et al., 2015)... We used the Kraken2 tool (Wood et al., 2019) to annotate each sequence... We focused on human-host associated microbiomes from the HMP2 project (Peterson et al., 2009). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits with specific percentages or counts, or refer to standard splits. |
| Hardware Specification | Yes | Server 1 CPU 56 cores Intel Xeon E5-2660 v4 @ 2.00GHz... Server 2 CPU 64 cores Intel Xeon Gold 5218 @ 2.30GHz... GPU ASPEED Graphics Family (rev 30)... Memory (GB) 528.27... 394.86 |
| Software Dependencies | No | The paper mentions software tools like Kraken2, Trimmomatic, Mash Screen, and Diginorm, but generally does not provide specific version numbers for these tools. |
| Experiment Setup | Yes | Our algorithm requires five hyperparameters: τ, R, B, k, and n. We use R = 10 RACE repetitions for all tasks. We use B = 1M except for on short-read datasets, where we use B = 100K... k is the length of the k-mers... We use k = 8 for Uni Ref, k = 18 for short-read metagenomes and k = 22 for long-read metagenomes... n is the number of LSH concatenations... we simply use n = 1... We sweep τ from 0.1 to 100, C from 1 to 20 and M from 1 to the number of reads. |