Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

Authors: Neil He, Rishabh Anand, Hiren Madhu, Ali Maatouk, Smita Krishnaswamy, Leandros Tassiulas, Menglin Yang, Rex Ying

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures up to 4% over popular Euclidean architectures used in LLa MA and Deep Seek with superior semantic hierarchy modeling capabilities, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale language model pretraining.
Researcher Affiliation	Academia	Neil He Rishabh Anand Hiren Madhu Ali Maatouk Smita Krishnaswamy Leandros Tassiulas Menglin Yang Rex Ying Yale University, USA
Pseudocode	No	The paper includes mathematical equations and formulas but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Open-source code: github.com/Graph-and-Geometric-Learning/helm
Open Datasets	Yes	We use the English portion of the Wikipedia dataset [14] for training, comprising 6.4M rows of raw text, or roughly 5B tokens.
Dataset Splits	No	For the training dataset, we use the English portion of the Wikipedia dataset [14]. This dataset consists of 6.4M rows of data. We download the dataset directly from Huggingface. The raw text data is then passed through the LLa MA3.1-8B tokenizer [18], which has a vocabulary size of 128K. We use a sequence length of 2048 for all models. Samples longer than 2048 tokens were broken up into multiple samples, with the trailing tailed dropped. The tokenized dataset consist of roughly 4.5B 5B tokens. For training efficiency, as we measured the average number of tokens per sample is 700 across the dataset, we used sample packing with a packing ratio of 3.0. Then packed samples shorted than 2048 tokens are then padded on the right. While the paper mentions using the Wikipedia dataset for training and evaluates on benchmarks with standard splits, it does not specify a train/validation/test split for the Wikipedia dataset itself.
Hardware Specification	Yes	Each model was trained on a cluster of 4 NVIDIA A6000 and 4 NVIDIA A800 GPUs with model and data parallelism, where at most 4 GPUs were used by each model.
Software Dependencies	No	For training, we set up data-parallelism with Hugginface Accelerate. We use the LLa MA3.1-8B tokenizer [18] for all models, with a vocabulary size of 128K. The paper mentions software tools like Huggingface Accelerate and LLa MA3.1-8B tokenizer but does not provide specific version numbers for these or other key software dependencies.
Experiment Setup	Yes	For training, we set up data-parallelism with Hugginface Accelerate. We use an effective batch size of 2M tokens (including padding). To ensure a fair comparison between the hyperbolic and Euclidean models, we use a learning rate of 2e-4 for all dense models and a learning rate of 4e-4 for the Mo E and MICE models. A weight decay rate of 0.01 was used for all models. For the HELM-MICE models and the Deep Seek models, in order to balance the load between each expert, we utilize the auxiliary-loss-free load balancing strategy and the complementary sequence-wise auxiliary loss during training. The former punishes extreme load imbalance among the experts by dynamically updating a bias term during the gating module, while not needing an explicit auxiliary loss computation for better training efficiency. The latter punishes extreme load imbalance for any particular sequence. All training used a cosine annealing learning rate scheduler with a final target learning rate of 0.1 the initial learning rate, with 3% of the gradient update steps used as warmup steps.