A Mutual Information Maximization Perspective of Language Representation Learning
Authors: Lingpeng Kong, Cyprien de Masson d'Autume, Lei Yu, Wang Ling, Zihang Dai, Dani Yogatama
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate the effects of training masked language modeling with negative sampling and adding IDIM to the quality of learned representations. ... Table 2: Summary of results on GLUE. ... Table 3: Summary of results on SQuAD 1.1. |
| Researcher Affiliation | Collaboration | DeepMind, Carnegie Mellon University, Google Brain |
| Pseudocode | No | The paper describes algorithms but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a link to the original BERT model ('https://github.com/google-research/bert') which is used as a baseline, but does not provide a concrete access link or explicit statement for the open-sourcing of their own reimplementation (BERT-NCE) or proposed model (INFOWORD). |
| Open Datasets | Yes | We use the same training corpora and apply the same preprocessing and tokenization as BERT. ... We evaluate on two benchmarks: GLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016). |
| Dataset Splits | Yes | For each GLUE task, we use the respective development set to choose the learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}, and the batch size from {16, 32}. ... We use the development set to choose the learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and the batch size from {16, 32}. |
| Hardware Specification | No | The paper describes model architectures (e.g., 'BERTBASE has 12 hidden layers, 768 hidden dimensions, and 12 attention heads') but does not specify the hardware used for training or experiments (e.g., GPU models, CPU models, or cloud computing instances with specifications). |
| Software Dependencies | No | The paper mentions using 'Adam (Kingma & Ba, 2015)' as an optimizer but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the implementation. |
| Experiment Setup | Yes | Pretraining. We use Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98 and ϵ = 1e-6. The batch size for training is 1024 with a maximum sequence length of 512. We train for 400,000 steps (including 18,000 warmup steps) with a weight decay rate of 0.01. We set the learning rate to 4e-4 for all variants of the BASE models and 1e-4 for the LARGE models. We set λMLM to 1.0 and tune λDIM ∈ {0.4, 0.6, 0.8, 1.0}. |