Revisiting K-mer Profile for Effective and Scalable Genome Representation Learning

Authors: Abdulkadir Celikkanat, Andres Masegosa, Thomas Nielsen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically assess the proposed embeddings on metagenomic binning tasks and compare their performance with large state-of-the-art genome foundation models. Our findings indicate that, while both sets of models produce comparable quality in terms of the MAGs recovered, the proposed models require significantly fewer computational resources. Figure 1 demonstrates this by showing the number of parameters in our k-mer embedding methods compared to those in the state-of-the-art genome foundation models.
Researcher Affiliation Academia Abdulkadir Çelikkanat Aalborg University 9000 Aalborg, Denmark abce@cs.aau.dk Andres R. Masegosa Aalborg University 9000 Aalborg, Denmark arma@cs.aau.dk Thomas D. Nielsen Aalborg University 9000 Aalborg, Denmark tdn@cs.aau.dk
Pseudocode No The paper describes methods using mathematical equations and textual explanations, but it does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes The datasets and implementation of the proposed architectures can be found at the following address: https://github.com/abdcelikkanat/revisitingkmers.
Open Datasets Yes We utilize the same publicly available datasets used to benchmark the genome foundation models [32]. The training set consists of 2 million pairs of non-overlapping DNA sequences, each 10, 000 bases in length, constructed by sampling from the dataset, including 17, 636 viral, 5, 011 fungal, and 6, 402 distinct bacterial genomes from Gen Bank [3]. For model evaluation, we use six datasets derived from the CAMI2 challenge data [13], representing marine and plant-associated environments and including fungal genomes.
Dataset Splits Yes We suppose that the number of clusters (i.e., genomes) in the evaluation datasets is unknown, so we utilize the modified K-medoid algorithm[32] to infer the clusters. It requires a threshold value due to the initially unknown number of clusters so to address this, we use separate datasets derived from the same source as the target evaluation datasets. For each method, we first compute the cluster centroids using the ground-truth labels and the embeddings generated by the respective approaches. Then, we calculate the similarities between the centroid vector and the read embeddings within the same cluster. The threshold value is selected as the 70th percentile of these sorted similarity scores. After inferring the clusters by the modified K-medoid algorithm, the extracted cluster labels are then aligned with the ground truths labels by the Hungarian algorithm. The datasets named as Dataset 0 are only used to detect the optimal threshold value for the K-medoid algorithm and the other variants (i.e., Dataset 5 and Dataset 6) are considered for the evaluation.
Hardware Specification No Our proposed models were trained on a cluster equipped with various NVIDIA GPU models.
Software Dependencies No The paper mentions using the "Adam optimizer" but does not specify software versions for libraries, frameworks, or other dependencies. For example, it doesn't state Python version, PyTorch/TensorFlow versions, CUDA versions, etc.
Experiment Setup Yes For the optimization of our models, we employed the Adam optimizer with a learning rate of 10 3. We used smaller subsets of the dataset for training our models, and we sampled 104 reads for OURS(POIS) and 106 sequences for OURS(NL). The OURS(NL) model was trained for 300 epochs with a mini-batch size of 104, while OURS(POIS) was trained for 1000 epochs using full-batch updates and window size of 4. Since we incorporated the contrastive learning strategy for OURS(NL), we randomly sampled 200 read halves to form the negative instances for each positive sample (one half of a read). We have set k = 4 and the final embedding dimension to 256 for our all models.