Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Causality-Induced Positional Encoding for Transformer-Based Representation Learning of Non-Sequential Features
Authors: Kaichen Xu, Yihang Du, Mianpeng Liu, Zimu Yu, Xiaobo Sun
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate CAPE over both synthetic and real-word datasets, empirically demonstrating its theoretical properties and effectiveness in enhancing transformer for data with non-sequential features. |
| Researcher Affiliation | Academia | Kaichen Xu1 , Yihang Du2, Mianpeng Liu2, Zimu Yu2, Xiaobo Sun3 1 Department of Computer Science, Emory University 2 School of Statistics and Mathematics, Zhongnan University of Economics and Law 3 School of Medicine, Department of Human Genetics, Emory University |
| Pseudocode | No | The paper provides detailed mathematical formulations and step-by-step descriptions of the methodology, but it does not include any explicitly labeled pseudocode blocks or algorithms in a structured, code-like format. |
| Open Source Code | Yes | Our code is available at https://github.com/Catchxu/CAPE. |
| Open Datasets | Yes | We collect a wide variety of single-cell multi-omics datasets from Homo sapiens and Mus musculus, which are sourced from the CELLx GENE database [76] at https://cellxgene.cziscience.com/. This collection includes 1,465 datasets, encompassing around 91.5 million cells and covering approximately 900 different cell types, with data spanning several sequencing methods and omics modalities. |
| Dataset Splits | Yes | The former does not need to be split, while the latter needs to be split into a fine-tuning dataset and a test dataset in a 3:7 ratio according to different uses. |
| Hardware Specification | No | The paper mentions training procedures and optimizers but does not specify any particular hardware components like GPU models, CPU types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions software like Scanpy [50], Max Quant, Uni Prot human proteome database [94], SCo PE2 [45], Seurat packages [95], and Adam W optimizer, but it does not provide specific version numbers for these tools or libraries. |
| Experiment Setup | Yes | Causal Structure Learning (Step I) Given a preprocessed matrix X RN M, we parameterize the causal graph as a learnable matrix A RM M. Both encoder and decoder are 1 64 1 MLPs. We train A via Eq. (9) with regularization coefficient λs = 1, where we use Adam W with a batch size of 128, a learning rate of 3e-3, and 100 epochs for optimization. After training, we apply a pruning threshold of τ = 0.2 to obtain the final adjacency matrix. Mapping Causal Structure to Hyperbolic Space (Step II) Given the trained A RM M from Step I, we map each variable into a d-dimensional hyperbolic space, where d = D/2, and the dimensionality of variable embeddings D is determined by the selected transformer backbones (e.g., D = 200 for sc BERT and D = 512 for sc GPT). Then, k hop in the graph contrastive learning Eq. (16) is set as 2, while the regularization weight λg is set as 0.1, and the relative weight for the restart matrix w is set as 0.15. Finally, we also choose the Adam W optimizer with a batch size of 32, a learning rate of 1e-3, and 1000 epochs for optimization. Transforming Hyperbolic Positional Encoding to Rotary Form (Step III) For sc BERT, we set the dimension of feature embeddings to 200 and the backbone network adopts the performer architecture. The pre-training process is consistent with the values in the original sc BERT study, that is, epochs is set to 100, batch size is 3, learning rate is 1e-4, and Adam is used for optimization. For sc GPT, we set the dimension of feature embeddings to 512. The backbone network has 4 transformer blocks, each with 8 attention heads. The pre-training process is consistent with the values in the original sc BERT study, that is, epochs is set to 60, batch size is 5, learning rate is 1e-4, and Adam is used for optimization. |