Neural Distance Embeddings for Biological Sequences

Authors: Gabriele Corso, Zhitao Ying, Michal Pándy, Petar Veličković, Jure Leskovec, Pietro Liò

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The capacity of the framework and the significance of these improvements are then demonstrated devising supervised and unsupervised Neuro SEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display significant accuracy and/or runtime improvements on real-world datasets.
Researcher Affiliation Collaboration Gabriele Corso MIT Rex Ying Stanford University Michal Pándy University of Cambridge Petar Veliˇckovi c Deep Mind Jure Leskovec Stanford University Pietro Liò University of Cambridge
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code, datasets and tuned hyperparameters can be found at https://github.com/gcorso/NeuroSEED.
Open Datasets Yes Datasets To evaluate the heuristics we chose three datasets containing different portions of the 16S r RNA gene, crucial in microbiome analysis [21], one of the most promising applications of our approach. The first, Qiita [21], contains more than 6M sequences of up to 152 bp that cover the V4 hyper-variable region. The second, RT988 [11], has only 6.7k publicly available sequences of length up to 465 bp covering the V3-V4 regions. Both datasets were generated by Illumina Mi Seq [22] and contain sequences of approximately the same length. Qiita was collected from skin, saliva and faeces samples, while RT988 was from oral plaques. The third dataset is the Greengenes full-length 16S r RNA database [23], which contains more than 1M sequences of length between 1,111 to 2,368. Moreover, we used a dataset of synthetically generated sequences to test the importance of data-dependent approaches. A full description of the data splits for each of the tasks is provided in Appendix B.4.
Dataset Splits Yes A full description of the data splits for each of the tasks is provided in Appendix B.4.
Hardware Specification Yes The training and inference time comparisons provided in the Greengenes dataset are shown for a set of 5k sequences and were all run on a CPU (Intel Core i7) with the exception of the neural models training that was run on GPU (Ge Force GTX TITAN Xp).
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes Code, datasets and tuned hyperparameters can be found at https://github.com/gcorso/NeuroSEED. All neural models have an embedding space dimension of 128. During training, Gaussian noise is added to the embedded point in the latent space forcing the decoder to be robust to points not directly produced by the encoder.