Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Neural Embeddings for kNN Search in Biological Sequence
Authors: Zhihao Chang, Linzhu Yu, Yanchao Xu, Wentao Hu
AAAI 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments show that our Bio-k NN significantly outperforms the state-of-the-art methods on two large-scale datasets without increasing the training cost. |
| Researcher Affiliation | Academia | Zhihao Chang1, Linzhu Yu2, Yanchao Xu2, Wentao Hu3 1The State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China 2College of Computer Science and Technology, Zhejiang University, Hangzhou, China 3Zhejiang Police College, Hangzhou, China |
| Pseudocode | No | The paper describes methods in prose and with mathematical formulas but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and datasets are available at https://github.com/Proudc/Bio-KNN. |
| Open Datasets | Yes | We evaluate our neural embeddings through the utilization of two extensively recognized datasets(Dai et al. 2020; Zhang, Yuan, and Indyk 2019), i.e., the Uniprot and Uniref. These datasets exhibit varying sizes and sequence lengths, and their properties are shown in the Table 1. |
| Dataset Splits | Yes | Consistent with existing works, we partition each dataset into distinct subsets, namely the training set, query set, and base set. Both the training set and the query set are composed of 1,000 sequences, and the other items belong to the base set. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running the experiments (e.g., GPU models, CPU types, or memory). |
| Software Dependencies | No | The paper mentions 'We use the EMBOSS1 to compute the NW distance between sequences' and provides a GitHub link for code and datasets, but it does not specify version numbers for EMBOSS or any other software dependencies. |
| Experiment Setup | No | The paper provides some details like 'set the split interval δ = 100' and mentions using 'the CNN submodule in CNNED', but it lacks specific hyperparameter values (e.g., learning rate, batch size, number of epochs, optimizer settings) or a comprehensive description of the experimental setup in the main text. |