reproducibilityindex.ai

Neural Distance Embeddings for Biological Sequences

Authors: Gabriele Corso, Zhitao Ying, Michal Pándy, Petar Veličković, Jure Leskovec, Pietro Liò

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The capacity of the framework and the signiﬁcance of these improvements are then demonstrated devising supervised and unsupervised Neuro SEED approaches to multiple core tasks in bioinformatics. Benchmarked with common baselines, the proposed approaches display signiﬁcant accuracy and/or runtime improvements on real-world datasets.
Researcher Affiliation	Collaboration	Gabriele Corso MIT Rex Ying Stanford University Michal Pándy University of Cambridge Petar Veliˇckovi c Deep Mind Jure Leskovec Stanford University Pietro Liò University of Cambridge
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code, datasets and tuned hyperparameters can be found at https://github.com/gcorso/NeuroSEED.
Open Datasets	Yes	Datasets To evaluate the heuristics we chose three datasets containing different portions of the 16S r RNA gene, crucial in microbiome analysis [21], one of the most promising applications of our approach. The ﬁrst, Qiita [21], contains more than 6M sequences of up to 152 bp that cover the V4 hyper-variable region. The second, RT988 [11], has only 6.7k publicly available sequences of length up to 465 bp covering the V3-V4 regions. Both datasets were generated by Illumina Mi Seq [22] and contain sequences of approximately the same length. Qiita was collected from skin, saliva and faeces samples, while RT988 was from oral plaques. The third dataset is the Greengenes full-length 16S r RNA database [23], which contains more than 1M sequences of length between 1,111 to 2,368. Moreover, we used a dataset of synthetically generated sequences to test the importance of data-dependent approaches. A full description of the data splits for each of the tasks is provided in Appendix B.4.
Dataset Splits	Yes	A full description of the data splits for each of the tasks is provided in Appendix B.4.
Hardware Specification	Yes	The training and inference time comparisons provided in the Greengenes dataset are shown for a set of 5k sequences and were all run on a CPU (Intel Core i7) with the exception of the neural models training that was run on GPU (Ge Force GTX TITAN Xp).
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Code, datasets and tuned hyperparameters can be found at https://github.com/gcorso/NeuroSEED. All neural models have an embedding space dimension of 128. During training, Gaussian noise is added to the embedded point in the latent space forcing the decoder to be robust to points not directly produced by the encoder.