Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Authors: Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures up to 4% over popular Euclidean architectures used in LLa MA and Deep Seek with superior semantic hierarchy modeling capabilities, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale language model pretraining.
Researcher Affiliation Academia Neil He Rishabh Anand Hiren Madhu Ali Maatouk Smita Krishnaswamy Leandros Tassiulas Menglin Yang Rex Ying Yale University, USA
Pseudocode No The paper includes mathematical equations and formulas but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Open-source code: github.com/Graph-and-Geometric-Learning/helm
Open Datasets Yes We use the English portion of the Wikipedia dataset [14] for training, comprising 6.4M rows of raw text, or roughly 5B tokens.
Dataset Splits No For the training dataset, we use the English portion of the Wikipedia dataset [14]. This dataset consists of 6.4M rows of data. We download the dataset directly from Huggingface. The raw text data is then passed through the LLa MA3.1-8B tokenizer [18], which has a vocabulary size of 128K. We use a sequence length of 2048 for all models. Samples longer than 2048 tokens were broken up into multiple samples, with the trailing tailed dropped. The tokenized dataset consist of roughly 4.5B 5B tokens. For training efficiency, as we measured the average number of tokens per sample is 700 across the dataset, we used sample packing with a packing ratio of 3.0. Then packed samples shorted than 2048 tokens are then padded on the right. While the paper mentions using the Wikipedia dataset for training and evaluates on benchmarks with standard splits, it does not specify a train/validation/test split for the Wikipedia dataset itself.
Hardware Specification Yes Each model was trained on a cluster of 4 NVIDIA A6000 and 4 NVIDIA A800 GPUs with model and data parallelism, where at most 4 GPUs were used by each model.
Software Dependencies No For training, we set up data-parallelism with Hugginface Accelerate. We use the LLa MA3.1-8B tokenizer [18] for all models, with a vocabulary size of 128K. The paper mentions software tools like Huggingface Accelerate and LLa MA3.1-8B tokenizer but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup Yes For training, we set up data-parallelism with Hugginface Accelerate. We use an effective batch size of 2M tokens (including padding). To ensure a fair comparison between the hyperbolic and Euclidean models, we use a learning rate of 2e-4 for all dense models and a learning rate of 4e-4 for the Mo E and MICE models. A weight decay rate of 0.01 was used for all models. For the HELM-MICE models and the Deep Seek models, in order to balance the load between each expert, we utilize the auxiliary-loss-free load balancing strategy and the complementary sequence-wise auxiliary loss during training. The former punishes extreme load imbalance among the experts by dynamically updating a bias term during the gating module, while not needing an explicit auxiliary loss computation for better training efficiency. The latter punishes extreme load imbalance for any particular sequence. All training used a cosine annealing learning rate scheduler with a final target learning rate of 0.1 the initial learning rate, with 3% of the gradient update steps used as warmup steps.