Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning to Discover Regulatory Elements for Gene Expression Prediction

Authors: Xingyu Su, Haiyang Yu, Degui Zhi, Shuiwang Ji

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that Seq2Exp outperforms existing baselines in gene expression prediction tasks and discovers influential regions compared to commonly used statistical methods for peak detection such as MACS3. The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).
Researcher Affiliation Academia 1Texas A&M University, 2The University of Texas Health Science Center at Houston EMAIL,{degui.zhi}@uth.tmc.edu
Pseudocode No The paper describes the proposed methods and model designs (Sections 3 and 4) using mathematical formulations and descriptive text, but it does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The source code is released as part of the AIRS library (https://github.com/divelab/AIRS/).
Open Datasets Yes The CAGE data are sourced from the ENCODE project (Consortium et al., 2012), and we follow the methodology of Lin et al. (2024) to predict gene expression values for 18,377 protein-coding genes. ... We obtained the DNase-seq data from the ENCODE project (Consortium et al., 2012) ... We also obtained H3K27ac data from the ENCODE project (Consortium et al., 2012) ... The Hi-C data were sourced from the 4D Nucleome project (Dekker et al., 2017).
Dataset Splits Yes We evaluate model performance using a cross-chromosome validation strategy. The model is trained on all chromosomes except those designated for validation and testing. Specifically, chromosomes 3 and 21 are used as the validation set, and chromosomes 22 and X are reserved for the test set.
Hardware Specification Yes All experiments were conducted on a system equipped with an NVIDIA A100 80GB PCIe GPU.
Software Dependencies No The paper mentions deep learning models such as Caduceus, Hyena DNA, Mamba, and Enformer, and a peak calling tool MACS3. However, it does not specify version numbers for programming languages or libraries (e.g., Python, PyTorch, TensorFlow) used for the implementation of the proposed method.
Experiment Setup Yes Specifically, we train for 50,000 steps on a 4-layer Caduceus architecture from scratch with a hidden dimension of 128, and more hyperparameters can be found in the Appendix A.4. ... The input sequences consist of 200,000 base pairs, centered around the promoter regions of the target genes, providing sufficient contextual information for accurate gene expression prediction. ... Table 3: Hyperparameter values and their search space (final choices are highlighted in bold). Hyperparameters Values # Layers of Generator 4 # Layers of Predictor 4 Hidden dimensions 128 α3, β3 [1, 9], [10,90], [10, 190], [10, 10], [10, 1.11] # training steps 50000, 85000 Batch size 8 Learning rate 1e 3, 5e-4, 1e 4, 5e 5 Scheduler strategy Cosine LR with Linear Warmup Initial warmup learning rate 1e-5 Min learning rate 1e-4 Warmup steps 5,000 Validation model selection criterion validation MSE