MAGNET: Improving the Multilingual Fairness of Language Models with Adaptive Gradient-Based Tokenization
Authors: Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Valentin Hofmann, Tomasz Limisiewicz, Yulia Tsvetkov, Noah A. Smith
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, we demonstrate that in addition to reducing segmentation disparities, MAGNET also enables faster language modelling and improves downstream utility. |
| Researcher Affiliation | Academia | 1University of Washington 2Allen Institute for AI 3The Ohio State University 4Charles University |
| Pseudocode | No | The paper describes the model architecture and processes using natural language and mathematical equations, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | Yes | Code and data are publicly available at https://github.com/orevaahia/ magnet-tokenization |
| Open Datasets | Yes | Our pretraining data is obtained from the OSCAR dataset [32]. |
| Dataset Splits | No | The paper mentions using well-known datasets for pretraining (OSCAR) and finetuning (XQuAD, XNLI, PAWS-X, SIB 200), which often have predefined splits. However, it does not explicitly state the specific train/validation/test split percentages or sample counts, nor does it explicitly state that 'standard splits' were used. |
| Hardware Specification | Yes | We use a learning rate of 5e-5, a warmup ratio of 0.1 and 38,000 training steps, a batch size of 512 distributed across 4 A40 GPUs. |
| Software Dependencies | No | The paper mentions mathematical functions and optimizers like 'GELU activation function [18]' and 'Adam optimizer [21]', but does not specify software dependencies such as specific library names with their version numbers (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x). |
| Experiment Setup | Yes | For all our experiments, we use 14-layer hourglass transformers with 2 layers in the first block, 10 layers in the second block and 2 layers in the final block. For every transformer layer, the hidden dimension is 768, the intermediate feed-forward dimension is 3072. Each self-attention layer consists of 12 heads. We use a post-norm architecture, GELU activation function [18] in feedforward layers and the relative attention parametrisation from Transformer XL. This brings our model s size to 126M parameters. The boundary predictor is a 2-layer MLP that takes in a hidden state as input and outputs a scalar prediction at each time step. We use the Adam optimizer [21] with (β1, β2) and ϵ parameters as (0.9, 0.98) and 1e-6, respectively. We use a learning rate of 5e-5, a warmup ratio of 0.1 and 38,000 training steps, a batch size of 512 distributed across 4 A40 GPUs. Each batch consists of examples concatenated up to the maximum sequence length of 512. |