Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Multimodal 3D Genome Pre-training

Authors: Minghao Yang, Pengteng Li, Yan Liang, Qianyi Cai, Zhihang Zheng, Shichen Zhang, Pengfei ZHANG, Zhi-An Huang, Hui Xiong

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that MIXHIC significantly surpasses existing state-of-the-art methods in diverse downstream tasks. This work provides a valuable resource for advancing 3D genomics research.
Researcher Affiliation Academia 1Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou), China 2School of Artificial Intelligence, South China Normal University, China 3Thrust of Bioscience and Biomedical Engineering, The Hong Kong University of Science and Technology (Guangzhou), China 4Department of Computer Science, City University of Hong Kong (Dongguan), China 5Department of Computer Science and Engineering, The Hong Kong University of Science and Technology Hong Kong SAR, China
Pseudocode No The paper describes the methodology in prose and mathematical equations within Section 4 'Methodology' and its subsections, but does not present any structured pseudocode or algorithm blocks.
Open Source Code Yes The source code of MIX-HIC is available at https://github.com/myang998/MIX-HIC.
Open Datasets Yes To distill the comprehensive semantics from the 3D genome, we collect and refine a large-scale dataset for pre-training MIX-HIC, using publicly available data from the hg38 assembly. The Hi-C contact maps are obtained from the 4DN Data Portal1, while the epigenomic tracks (ATAC-seq and DNase-seq, which measure how open or accessible DNA is for transcription), CAGE-seq expression data (which directly quantifies gene activity levels), and CTCF ChIA-PET [29] chromatin loops (which identify high-confidence interactions mediated by the key architectural protein CTCF) are downloaded from the ENCODE Portal 2. 1https://data.4dnucleome.org/ 2https://www.encodeproject.org/
Dataset Splits Yes Following previous work [10, 6, 5], we partition the chromosomes into distinct training, validation, and test sets. Specifically, chromosomes 10 and 11 serve as the validation set, chromosomes 3, 13, and 17 as the test set, and the remaining chromosomes are used for model training across three downstream tasks.
Hardware Specification Yes MIX-HIC is developed using Python and PyTorch, and executed on the Ubuntu platform with a Tesla A100 GPU.
Software Dependencies No MIX-HIC is developed using Python and PyTorch, and executed on the Ubuntu platform with a Tesla A100 GPU. Table 12 lists general software used like Huggingface, Scikit-Learn, Numpy, Pytorch, and Matplotlib, but does not specify their version numbers.
Experiment Setup Yes During the pre-training stage, MIX-HIC is configured with 500 epochs, a learning rate of 1e-5, and a batch size of 256. For the CAGE-seq expression prediction task, the predefined feature dimension C and learning rate are set to 256 and 1e-4, respectively. For the prediction of Hi-C contact maps and chromatin loops, these parameters are specified as 128 and 1e-5, respectively. The number of transformer blocks T in each encoder, contact map-grounded fusion block, and decoder is set to 2. Fine-tuning is conducted with a batch size of 64, utilizing the Adam W optimizer [45] with the momentum parameters β1 and β2 initialized to 0.9 and 0.999, respectively. The fine-tuning process is configured with a maximum of 200 epochs, and an early stopping strategy is employed with a patience parameter of 20 to prevent overfitting.