MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization

Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility.
Researcher Affiliation Academia 1University of Washington 2Allen Institute for AI 3The Ohio State University 4Charles University
Pseudocode No The paper describes the model architecture and processes using natural language and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code Yes Code and data are publicly available at https://github.com/orevaahia/ magnet-tokenization
Open Datasets Yes Our pretraining data is obtained from the OSCAR dataset [32].
Dataset Splits No The paper mentions using well-known datasets for pretraining (OSCAR) and finetuning (XQuAD, XNLI, PAWS-X, SIB 200), which often have predefined splits. However, it does not explicitly state the specific train/validation/test split percentages or sample counts, nor does it explicitly state that 'standard splits' were used.
Hardware Specification Yes We use a learning rate of 5e-5, a warmup ratio of 0.1 and 38,000 training steps, a batch size of 512 distributed across 4 A40 GPUs.
Software Dependencies No The paper mentions mathematical functions and optimizers like 'GELU activation function [18]' and 'Adam optimizer [21]', but does not specify software dependencies such as specific library names with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x).
Experiment Setup Yes For all our experiments, we use 14-layer hourglass transformers with 2 layers in the first block, 10 layers in the second block and 2 layers in the final block. For every transformer layer, the hidden dimension is 768, the intermediate feed-forward dimension is 3072. Each self-attention layer consists of 12 heads. We use a post-norm architecture, GELU activation function [18] in feedforward layers and the relative attention parametrisation from Transformer XL. This brings our model s size to 126M parameters. The boundary predictor is a 2-layer MLP that takes in a hidden state as input and outputs a scalar prediction at each time step. We use the Adam optimizer [21] with (β1, β2) and ϵ parameters as (0.9, 0.98) and 1e-6, respectively. We use a learning rate of 5e-5, a warmup ratio of 0.1 and 38,000 training steps, a batch size of 512 distributed across 4 A40 GPUs. Each batch consists of examples concatenated up to the maximum sequence length of 512.