Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Hyperbolic Fine-Tuning for Large Language Models

Authors: Menglin Yang, Ram B, Aosong Feng, Bo Xiong, Jiahong Liu, Irwin King, Rex Ying

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various base models and reasoning benchmarks, specifically arithmetic and commonsense reasoning tasks, demonstrate that Hyp Lo RA substantially improves LLM performance.
Researcher Affiliation	Academia	Menglin Yang1,2, Ram Samarth B B3, Aosong Feng4, Bo Xiong5 Jiahong Liu6, Irwin King6, Rex Ying4 1HKUST(GZ); 2HKUST; 3Indian Institute of Science; 4Yale University; 5Stanford University; 6The Chinese University of Hong Kong
Pseudocode	No	The paper describes the proposed method, Hyp Lo RA, using mathematical equations and textual explanations in Section 5, but does not include a distinct section or figure labeled 'Pseudocode' or 'Algorithm' with structured steps.
Open Source Code	Yes	Code: https://github.com/marlin-codes/HypLoRA Project : https://hyperboliclearning.github.io/work/hyplora
Open Datasets	Yes	We utilize two high-quality datasets, Math10K and Commonsense170K, tailored for mathematical and commonsense reasoning, respectively. Math10K consists of training data from GSM8K [83], MAWPS, MAWPS-single [85], and 1,000 samples from AQu A [84]... Commonsense170K is constructed by reformatting samples from Bool Q, PIQA, SIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA using standardized templates that outline the task, content, and answer, resulting in 170K training samples. The test datasets are drawn from the same sources, with strict separation from training samples.
Dataset Splits	Yes	Math10K consists of training data from GSM8K [83], MAWPS, MAWPS-single [85], and 1,000 samples from AQu A [84]... The test set includes GSM8K, AQu A, MAWPS, and SVAMP [86], ensuring no overlap with the training data. ... Commonsense170K is constructed by reformatting samples from Bool Q, PIQA, SIQA, Hella Swag, Wino Grande, ARC-e, ARC-c, and OBQA using standardized templates that outline the task, content, and answer, resulting in 170K training samples. The test datasets are drawn from the same sources, with strict separation from training samples.
Hardware Specification	Yes	The GPU hours for inference on four datasets are presented in Figure 2. (Figure 2 caption: GPU (A100) usage during inference)
Software Dependencies	No	The paper mentions using the Adam W optimizer and the 'powerlaw' package for analysis, but does not provide specific version numbers for these or other key software components used for implementation.
Experiment Setup	Yes	Across all fine-tuning tasks, we employed the Adam W optimizer with a learning rate of 3 × 10−4 and trained for a total of three epochs. Lo RA modules (and consequently, Hyp Lo RA adapters) were integrated into both the Multi-Head Attention (MHA) and MLP layers of the foundation models. A key hyperparameter for Hyp Lo RA is the curvature K (defining the hyperbolic curvature as 1/K), which was initialized by searching the set {0.5, 1.0}.