Neural Distance Embeddings for Biological Sequences
Authors: Gabriele Corso, Zhitao Ying, Michal Pándy, Petar Veličković, Jure Leskovec, Pietro Liò
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised Neuro SEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets. |
| Researcher Affiliation | Collaboration | Gabriele Corso MIT Rex Ying Stanford University Michal Pándy University of Cambridge Petar Veliˇckovi c Deep Mind Jure Leskovec Stanford University Pietro Liò University of Cambridge |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code, datasets and tuned hyperparameters can be found at https://github.com/gcorso/NeuroSEED. |
| Open Datasets | Yes | Datasets To evaluate the heuristics we chose three datasets containing different portions of the 16S r RNA gene, crucial in microbiome analysis [21], one of the most promising applications of our approach. The first, Qiita [21], contains more than 6M sequences of up to 152 bp that cover the V4 hyper-variable region. The second, RT988 [11], has only 6.7k publicly available sequences of length up to 465 bp covering the V3-V4 regions. Both datasets were generated by Illumina Mi Seq [22] and contain sequences of approximately the same length. Qiita was collected from skin, saliva and faeces samples, while RT988 was from oral plaques. The third dataset is the Greengenes full-length 16S r RNA database [23], which contains more than 1M sequences of length between 1,111 to 2,368. Moreover, we used a dataset of synthetically generated sequences to test the importance of data-dependent approaches. A full description of the data splits for each of the tasks is provided in Appendix B.4. |
| Dataset Splits | Yes | A full description of the data splits for each of the tasks is provided in Appendix B.4. |
| Hardware Specification | Yes | The training and inference time comparisons provided in the Greengenes dataset are shown for a set of 5k sequences and were all run on a CPU (Intel Core i7) with the exception of the neural models training that was run on GPU (Ge Force GTX TITAN Xp). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Code, datasets and tuned hyperparameters can be found at https://github.com/gcorso/NeuroSEED. All neural models have an embedding space dimension of 128. During training, Gaussian noise is added to the embedded point in the latent space forcing the decoder to be robust to points not directly produced by the encoder. |