reproducibilityindex.ai

A Mutual Information Maximization Perspective of Language Representation Learning

Authors: Lingpeng Kong, Cyprien de Masson d'Autume, Lei Yu, Wang Ling, Zihang Dai, Dani Yogatama

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate the effects of training masked language modeling with negative sampling and adding IDIM to the quality of learned representations. ... Table 2: Summary of results on GLUE. ... Table 3: Summary of results on SQuAD 1.1.
Researcher Affiliation	Collaboration	DeepMind, Carnegie Mellon University, Google Brain
Pseudocode	No	The paper describes algorithms but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a link to the original BERT model ('https://github.com/google-research/bert') which is used as a baseline, but does not provide a concrete access link or explicit statement for the open-sourcing of their own reimplementation (BERT-NCE) or proposed model (INFOWORD).
Open Datasets	Yes	We use the same training corpora and apply the same preprocessing and tokenization as BERT. ... We evaluate on two benchmarks: GLUE (Wang et al., 2019) and SQuAD (Rajpurkar et al., 2016).
Dataset Splits	Yes	For each GLUE task, we use the respective development set to choose the learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5}, and the batch size from {16, 32}. ... We use the development set to choose the learning rate from {5e-6, 1e-5, 2e-5, 3e-5, 5e-5} and the batch size from {16, 32}.
Hardware Specification	No	The paper describes model architectures (e.g., 'BERTBASE has 12 hidden layers, 768 hidden dimensions, and 12 attention heads') but does not specify the hardware used for training or experiments (e.g., GPU models, CPU models, or cloud computing instances with specifications).
Software Dependencies	No	The paper mentions using 'Adam (Kingma & Ba, 2015)' as an optimizer but does not provide specific version numbers for any software dependencies, libraries, or frameworks used in the implementation.
Experiment Setup	Yes	Pretraining. We use Adam (Kingma & Ba, 2015) with β1 = 0.9, β2 = 0.98 and ϵ = 1e-6. The batch size for training is 1024 with a maximum sequence length of 512. We train for 400,000 steps (including 18,000 warmup steps) with a weight decay rate of 0.01. We set the learning rate to 4e-4 for all variants of the BASE models and 1e-4 for the LARGE models. We set λMLM to 1.0 and tune λDIM ∈ {0.4, 0.6, 0.8, 1.0}.