Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Monet: Mixture of Monosemantic Experts for Transformers

Authors: Jungwoo Park, Young Jin Ahn, Kee-Eung Kim, Jaewoo Kang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analyses demonstrate mutual exclusivity of knowledge across experts and showcase the parametric knowledge encapsulated within individual experts. Moreover, MONET allows knowledge manipulation over domains, languages, and toxicity mitigation without degrading general performance. Our pursuit of transparent LLMs highlights the potential of scaling expert counts to enhance mechanistic interpretability and directly resect the internal knowledge to fundamentally adjust model behavior. The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet. EXPERIMENTS
Researcher Affiliation Collaboration Jungwoo Park1,3 , Young Jin Ahn2 , Kee-Eung Kim2 , Jaewoo Kang1,3 1Korea University, 2KAIST, 3AIGEN Sciences EMAIL EMAIL
Pseudocode Yes Algorithm 1: Simple JAX (Bradbury et al., 2018) and Flax (Heek et al., 2024) implementation of a MONET-HD layer. Algorithm 2: Simple JAX and Flax implementation of a MONET-VD layer.
Open Source Code Yes The source code and pretrained checkpoints are available at https://github.com/dmis-lab/Monet.
Open Datasets Yes All models are trained on 100 billion tokens sampled from the Fine Web-Edu dataset (Penedo et al., 2024), which combines high-quality web content with educational materials. Model configurations are in Table 6 Training is conducted on a TPU-v4-64 Pod Slice, utilizing the Adam W optimizer with a learning rate of 5 10 4 and a batch size of 2 million tokens. We employ Squared Re LU (So et al., 2021; Zhang et al., 2024; Adler et al., 2024) as the activation function. To manage computational resources effectively, we adopt a group routing strategy wherein the routing probabilities are reused every 4 layers. This approach reduces the overhead associated with the expert routing parameters. The weight of the auxiliary loss λ is set to 10 3 for all experiments. In addition, we train CODEMONET 1.4B to evaluate the model s capability in coding tasks and analyze multilingual specialization. CODEMONET is pretrained on 100 billion tokens sampled from STARCODERDATA, the primary dataset used to train the Star Coder model (Li et al., 2023).
Dataset Splits No The paper mentions training on 100 billion tokens from specific datasets (Fine Web-Edu, STARCODERDATA) and evaluates on benchmarks using '0-shot and 5-shot settings', but it does not provide explicit training/validation/test splits (e.g., percentages or exact counts) used for the datasets.
Hardware Specification Yes Training is conducted on a TPU-v4-64 Pod Slice, utilizing the Adam W optimizer with a learning rate of 5 10 4 and a batch size of 2 million tokens. ... The instruction tuning process is performed on a single NVIDIA A100 GPU. ... we create VISIONMONET by fine-tuning the MONET 1.4B CHAT model following the LLa VA s visual instruction tuning (Liu et al., 2024), using a single NVIDIA A100 GPU.
Software Dependencies No Algorithm 1 and 2 reference JAX (Bradbury et al., 2018) and Flax (Heek et al., 2024), but no specific version numbers are provided for these or other software components like AdamW optimizer, Squared ReLU, or external APIs (PERSPECTIVE API, ToxiGen RoBERTa model).
Experiment Setup Yes We pretrain our MONET models with parameter sizes of 850 million (850M), 1.4 billion (1.4B), and 4.1 billion (4.1B) ... All models are trained on 100 billion tokens sampled from the Fine Web-Edu dataset... Training is conducted on a TPU-v4-64 Pod Slice, utilizing the Adam W optimizer with a learning rate of 5 10 4 and a batch size of 2 million tokens. We employ Squared Re LU (So et al., 2021; Zhang et al., 2024; Adler et al., 2024) as the activation function. ... we adopt a group routing strategy wherein the routing probabilities are reused every 4 layers. ... The weight of the auxiliary loss λ is set to 10 3 for all experiments. ... generated code completions using a temperature of 0.8 and 200 samples per generation. ... Toxicity scores are obtained from the PERSPECTIVE API... We generate outputs with a temperature of 1.0 and a top-p value of 0.9, producing 25 samples of 20 new tokens per prompt. ... We generate outputs with a temperature of 0, producing new sequences of 30 tokens.