Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ChromFound: Towards A Universal Foundation Model for Single-Cell Chromatin Accessibiltiy Data

Authors: Yifeng Jiao, Yuchen Liu, Yu Zhang, Xin Guo, Yushuai Wu, Chen Jiang, Jiyang Li, Hongwei Zhang, LIMEI HAN, Xin Gao, Yuan Qi, Yuan Cheng

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of Chrom Found, we conduct comprehensive evaluations across multiple tasks and datasets. In cell representation, Chrom Found outperforms baselines instead of additional training, achieving an average ARI improvement of 17.02% across 8 sc ATAC-seq datasets from 4 tissues. ... 4 Experiments
Researcher Affiliation Academia 1Artificial Intelligence Innovation and Incubation Institute, Fudan University 2Shanghai Academy of Artificial Intelligence for Science 3Computer, Electrical and Mathematical Sciences and Engineering Division, KAUST 4Center of Excellence for Smart Health, KAUST 5Center of Excellence on Gen AI, KAUST 6Zhongshan Hospital, Fudan University
Pseudocode No The paper describes the overall workflow of the encoder layer in Section 3.2.3 with mathematical equations, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The implementation of Chrom Found is available via https://github.com/Johnson Klose/Chrom Found.
Open Datasets Yes We assemble a large-scale sc ATAC-seq dataset comprising over 2.64 million cells detailed in Table 6. For model pretraining, we select a representative subset of 1.97 million cells from more than 30 distinct organs and tissues. ... The resources of all datasets are detailed in Table 6.
Dataset Splits Yes For experimental settings, we divide the training data into 90% for training and 10% for validation. (Section B.2) ... We divide the whole dataset into 80% for training, 10% for evaluation, and 10% for testing. (Section B.3)
Hardware Specification Yes Chrom Found is trained for 5 epochs over 80 hours on a compute cluster comprising 4 machines with a total of 32 NVIDIA A100 GPUs. (Section 3.3) ... We compute the inference speed of Chrom Found on a node equipped with 32 CPU cores, 256 GB of memory, and one NVIDIA A100 GPU. (Section B.1)
Software Dependencies No The paper mentions using the Adam W optimizer and a 'compute cluster comprising 4 machines with a total of 32 NVIDIA A100 GPUs' (Section 3.3), which implies deep learning frameworks like PyTorch or TensorFlow, but does not provide specific version numbers for any software dependencies like Python, PyTorch, CUDA, or other libraries.
Experiment Setup Yes The effective batch size is set to 128. We use the Adam W optimizer [39] with a maximum learning rate of 5 10 5. The embedding dimension and the hidden size of WPSA D are set to 128, while dimension Dlow of the Mamba block is set to 32. The model architecture consists of 4 stacked encoder layers in total.