Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Tight Clusters Make Specialized Experts
Authors: Stefan Nielsen, Rachel Teo, Laziz Abdullaev, Tan Nguyen
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically demonstrate the advantages of our AC router over baseline routing methods when applied on a variety of Mo E backbones for language modeling and image recognition tasks in both clean and corrupted settings. |
| Researcher Affiliation | Collaboration | Stefan K. Nielsen FPT Software AI Center EMAIL Rachel S.Y. Teo Department of Mathematics National University of Singapore EMAIL |
| Pseudocode | No | The paper describes methods and equations for the Adaptive Clustering router but does not include a distinct block labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | The code is publicly available at https://github.com/stefvk/ACMo E. |
| Open Datasets | Yes | We evaluate our method on large-scale tasks including Wikitext-103 (Merity et al., 2016) language modeling and Image Net (Deng et al., 2009) object classification. |
| Dataset Splits | Yes | The Wiki Text-103 dataset... The validation set and test sets consist of 60 articles with 218K and 246K tokens respectively. En Wik-8 contains 90M characters for training, 5M for validation, and 5M for testing. We use the full Image Net dataset that contains 1.28M training images and 50K validation images. |
| Hardware Specification | Yes | All models are trained, evaluated, and finetuned on four NVIDIA A100 SXM4 40GB GPUs. |
| Software Dependencies | No | The paper mentions using 'Adam' and 'Adam W' optimizers but does not specify versions for these or any other software libraries or dependencies. |
| Experiment Setup | Yes | All experiments use Adam with a base learning rate of 0.0007. Small configurations use 3000 iterations of learning rate warmup while medium configurations use 4000 iterations. For Wiki Text-103 pretraining, small Switch backbones are trained for 40 epochs with a batch size of 96 and medium Switch backbones are trained for 80 epochs with a batch size of 48. We use 0.01 auxiliary load balancing loss. |