Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility. |
| Researcher Affiliation | Academia | 1University of Washington 2Allen Institute for AI 3The Ohio State University 4Charles University |
| Pseudocode | No | The paper describes the model architecture and processes using natural language and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code and data are publicly available at https://github.com/orevaahia/ magnet-tokenization |
| Open Datasets | Yes | Our pretraining data is obtained from the OSCAR dataset [32]. |
| Dataset Splits | No | The paper mentions using well-known datasets for pretraining (OSCAR) and finetuning (XQuAD, XNLI, PAWS-X, SIB 200), which often have predefined splits. However, it does not explicitly state the specific train/validation/test split percentages or sample counts, nor does it explicitly state that 'standard splits' were used. |
| Hardware Specification | Yes | We use a learning rate of 5e-5, a warmup ratio of 0.1 and 38,000 training steps, a batch size of 512 distributed across 4 A40 GPUs. |
| Software Dependencies | No | The paper mentions mathematical functions and optimizers like 'GELU activation function [18]' and 'Adam optimizer [21]', but does not specify software dependencies such as specific library names with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x). |
| Experiment Setup | Yes | For all our experiments, we use 14-layer hourglass transformers with 2 layers in the first block, 10 layers in the second block and 2 layers in the final block. For every transformer layer, the hidden dimension is 768, the intermediate feed-forward dimension is 3072. Each self-attention layer consists of 12 heads. We use a post-norm architecture, GELU activation function [18] in feedforward layers and the relative attention parametrisation from Transformer XL. This brings our model s size to 126M parameters. The boundary predictor is a 2-layer MLP that takes in a hidden state as input and outputs a scalar prediction at each time step. We use the Adam optimizer [21] with (β1, β2) and ϵ parameters as (0.9, 0.98) and 1e-6, respectively. We use a learning rate of 5e-5, a warmup ratio of 0.1 and 38,000 training steps, a batch size of 512 distributed across 4 A40 GPUs. Each batch consists of examples concatenated up to the maximum sequence length of 512. |