reproducibilityindex.ai

Neural Embeddings for kNN Search in Biological Sequence

Authors: Zhihao Chang, Linzhu Yu, Yanchao Xu, Wentao Hu

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments show that our Bio-k NN significantly outperforms the state-of-the-art methods on two large-scale datasets without increasing the training cost.
Researcher Affiliation	Academia	Zhihao Chang1, Linzhu Yu2, Yanchao Xu2, Wentao Hu3 1The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Zhejiang Police College, Hangzhou, China
Pseudocode	No	The paper describes methods in prose and with mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Code and datasets are available at https://github.com/Proudc/Bio-KNN.
Open Datasets	Yes	We evaluate our neural embeddings through the utilization of two extensively recognized datasets(Dai et al. 2020; Zhang, Yuan, and Indyk 2019), i.e., the Uniprot and Uniref. These datasets exhibit varying sizes and sequence lengths, and their properties are shown in the Table 1.
Dataset Splits	Yes	Consistent with existing works, we partition each dataset into distinct subsets, namely the training set, query set, and base set. Both the training set and the query set are composed of 1,000 sequences, and the other items belong to the base set.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory).
Software Dependencies	No	The paper mentions 'We use the EMBOSS1 to compute the NW distance between sequences' and provides a GitHub link for code and datasets, but it does not specify version numbers for EMBOSS or any other software dependencies.
Experiment Setup	No	The paper provides some details like 'set the split interval δ = 100' and mentions using 'the CNN submodule in CNNED', but it lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or a comprehensive description of the experimental setup in the main text.