Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
More Experts Than Galaxies: Conditionally-Overlapping Experts with Biologically-Inspired Fixed Routing
Authors: Sagi Shaier, Francisco Pereira, Katharina Kann, Lawrence E Hunter, Matt Jones
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the effectiveness of COMET on a range of tasks, including image classification, language modeling, and regression, using several popular deep learning architectures. ... We validate our approach through experiments on seven diverse tasks, including image classification, language modeling, and regression, demonstrating that our method is applicable to many popular model architectures such as vision transformers, MLP-mixers, GPTs, and standard MLPs, and consistently provides improved performance. |
| Researcher Affiliation | Collaboration | Sagi Shaier Department of Computer Science University of Colorado Boulder EMAIL, Francisco Pereira Machine Learning Core National Institute of Mental Health EMAIL, Katharina von der Wense Department of Computer Science University of Colorado Boulder Institute of Computer Science Johannes Gutenberg University Mainz EMAIL, Lawrence E Hunter Department of Pediatrics University of Chicago EMAIL, Matt Jones Department of Psychology and Neuroscience University of Colorado Boulder Google Deep Mind EMAIL |
| Pseudocode | No | The paper describes the architecture and computation steps using mathematical equations (1) through (8) and accompanying text, but does not present a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code can be found here: https://github.com/Shaier/COMET.git. |
| Open Datasets | Yes | We evaluate these models on the CIFAR10 dataset Krizhevsky (2009), with results shown in fig. 4. ... We evaluate the performance of these models on four widely-used image classification datasets: SVHN Netzer et al. (2011), CIFAR10 Krizhevsky (2009), CIFAR100 Krizhevsky (2009), and Tiny Image Net Le & Yang (2015). ... We apply COMET to language modeling on Wikitext (Merity et al., 2016) and Code Parrot (Tunstall et al., 2022) ... We apply the COMET method to a regression task using the SARCOS dataset. This dataset is derived from an inverse dynamics problem ... The dataset is publicly available at https://gaussianprocess.org/gpml/data/. |
| Dataset Splits | Yes | We evaluate these models on the CIFAR10 dataset Krizhevsky (2009), with results shown in fig. 4. ... We apply COMET to language modeling on Wikitext (Merity et al., 2016) and Code Parrot (Tunstall et al., 2022) with varying GPT model sizes... We apply the COMET method to a regression task using the SARCOS dataset. This dataset is derived from an inverse dynamics problem involving a 7-joint anthropomorphic robot arm... For well-known benchmark datasets like CIFAR-10, WikiText, Code Parrot, and SARCOS, the train/test/validation splits are standardized or provided with the dataset, implicitly understood in the research community. |
| Hardware Specification | Yes | All models were trained using the Adam optimizer with a cosine learning rate schedule on a single A100 GPU. ... Each model was trained from scratch on a single A100 GPU. |
| Software Dependencies | No | The paper refers to "Hugging Face (2022)" for tuned hyperparameters and mentions using "Adam W" optimizer, but does not provide specific version numbers for software libraries like PyTorch, Transformers, or Python. |
| Experiment Setup | Yes | We employ a standard 4-layer MLP architecture, utilizing the SGD optimizer with a learning rate of 1e-4. To ensure robustness, we train each model over 3 random seeds for 100 epochs. We systematically explore the effects of varying model capacity and sparsity levels by modifying the number of neurons in each layer and the sparsity ratio. ... Our optimizer of choice was Adam W, with a learning rate of 5e-4, weight decay of 0.1, and 1,000 warmup steps. We also used gradient accumulation with 8 steps, which resulted in an effective batch size of 256, calculated by multiplying the per-device train batch size (32) by the gradient accumulation steps (8). We used tanh activation function on the backbone MLP layers and a cosine learning rate schedule with warmup. We also enabled mixed precision training to accelerate computations. |