Neural Embeddings for kNN Search in Biological Sequence

Authors: Zhihao Chang, Linzhu Yu, Yanchao Xu, Wentao Hu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that our Bio-k NN significantly outperforms the state-of-the-art methods on two large-scale datasets without increasing the training cost.
Researcher Affiliation Academia Zhihao Chang1, Linzhu Yu2, Yanchao Xu2, Wentao Hu3 1The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Zhejiang Police College, Hangzhou, China
Pseudocode No The paper describes methods in prose and with mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and datasets are available at https://github.com/Proudc/Bio-KNN.
Open Datasets Yes We evaluate our neural embeddings through the utilization of two extensively recognized datasets(Dai et al. 2020; Zhang, Yuan, and Indyk 2019), i.e., the Uniprot and Uniref. These datasets exhibit varying sizes and sequence lengths, and their properties are shown in the Table 1.
Dataset Splits Yes Consistent with existing works, we partition each dataset into distinct subsets, namely the training set, query set, and base set. Both the training set and the query set are composed of 1,000 sequences, and the other items belong to the base set.
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies No The paper mentions 'We use the EMBOSS1 to compute the NW distance between sequences' and provides a GitHub link for code and datasets, but it does not specify version numbers for EMBOSS or any other software dependencies.
Experiment Setup No The paper provides some details like 'set the split interval δ = 100' and mentions using 'the CNN submodule in CNNED', but it lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or a comprehensive description of the experimental setup in the main text.