Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

HiMoLE: Towards OOD-Robust LoRA via Hierarchical Mixture of Experts

Authors: Yinuo Jiang, Yan Xiaodong, Keyan Ding, Deng Zhao, Lei Liang, Qiang Zhang, Huajun Chen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Hi Mo LE on three representative tasks in natural language processing. Experimental results evidence that Hi Mo LE consistently outperforms existing Lo RA-based approaches, significantly reducing performance degradation on OOD data while improving in-distribution performance.
Researcher Affiliation	Collaboration	1College of Computer Science and Technology, Zhejiang University 2ZJU-Ant Group Joint Lab of Knowledge Graph 3ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University 4ZJU-UIUC Institute, Zhejiang University 5Ant Group EMAIL
Pseudocode	Yes	Algorithm 1: Hi Mo LE Two-Stage Training Input: LLM s frozen weights W0, training data D (composed of (s, y) pairs), pre-trained encoder Encoder( ), cluster count N Output: Optimized experts and routers
Open Source Code	Yes	All the datasets are open-source, and the code can be found in the supplementary materials.
Open Datasets	Yes	For the construction of the ID dataset, we rigorously curated English-language resources from the Big Bio benchmark [26]... For the selection of OOD datasets, we adopted the criteria outlined in Section 3.1, choosing the rare disease dataset [27]... We adopted the sentiment analysis component in SOCIALITEINSTRUCTIONS [29] dataset as our ID dataset... For the selection of OOD data, we adopted the criteria outlined in Section 3.1 and chose the OPTIMISM [30] dataset... Following previous work [4], we chose SQu AD [31] as the ID dataset... For the selection of OOD data, we chose News QA [32]... All datasets are downloaded from Hugging Face using the DATASETS library in Python.
Dataset Splits	Yes	To separate in-distribution data, we use Bio BERT [35], BERTweet [36], and De BERTa-v3-large [37] as encoders to extract sentence-level embeddings for the NER, SA and EQA tasks, respectively. Through sentence featurebased K-means clustering, we obtain subsets of sizes 4, 3 and 3 for each task (see the visualization of the clustering results in Appendix E.1). Table 6 summarizes the datasets used in our experiments, including their task names, respective domains, the number of training and test sets. For Biomedical NER, the ID test data was partitioned into 4 sub-datasets using feature-based K-means clustering.
Hardware Specification	Yes	All experiments are conducted with GPUs having 24GB memory (RTX 4090) for 7B models, GPUs having 40GB memory (RTX A100) for 13B models, and setup with Python 3.8 and Ubuntu 22.04 on x86-64 CPUs.
Software Dependencies	Yes	All experiments are conducted with GPUs having 24GB memory (RTX 4090) for 7B models, GPUs having 40GB memory (RTX A100) for 13B models, and setup with Python 3.8 and Ubuntu 22.04 on x86-64 CPUs.
Experiment Setup	Yes	We set a maximum of 10,000 training steps and perform evaluations on the validation sets of all benchmarks every 50 steps. If there is no improvement on the validation set for 10 consecutive evaluations, we will terminate the training early. The best checkpoint, identified by the highest average accuracy across all benchmarks, is then selected for evaluation on the test set. Table 7: Hyperparameter configurations of Lo RA, Mix Lo RA/Hydra Lo RA and Hi Mo LE for finetuning LLa MA2-7B and One KE-13B. Metric Lo RA Mix Lo RA/Hydra Lo RA Hi Mo LE Cutoff Length 1024 1024 1024 Learning Rate 3e-4 3e-4 3e-4(stage1), 3e-5(stage2) Optimizer Adam W Adam W Adam W Batch size 16 16 16 Dropout 0.05 0.05 0.05 Where Up, Down, Gate Up, Down, Gate Up, Down, Gate Lo RA Rank 80 8 8 Lo RA Alpha 160 16 16 Top-K 2 2