Exposing the Implicit Energy Networks behind Masked Language Models via Metropolis--Hastings

Authors: Kartik Goyal, Chris Dyer, Taylor Berg-Kirkpatrick

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the effectiveness of the proposed parametrizations by exploring the quality of samples drawn from these energybased models for both open-ended unconditional generation and a conditional generation task of machine translation. We empirically investigate the effectiveness of the two proposed energy parametrizations by examining the quality of samples drawn from these energy-models in two diverse settings: 1) conditional generation task of Machine Translation (MT), and 2) Open-ended unconditional generation. We observe that high BLEU scores for MT, and high fluency scores are correlated with low energy values which indicates that these parametrizations are reasonable proxies for the desired implicit bidirectional energy network trained via the MLM objective.
Researcher Affiliation Collaboration Kartik Goyal1, Chris Dyer2, Taylor Berg-Kirkpatrick3 1Carnegie Mellon University, 2Deepmind, 3UC San Diego kartikgo@ttic.edu, cdyer@google.com, tberg@eng.ucsd.edu
Pseudocode Yes Algorithm 1 Metropolis Hastings algorithm for MLMs
Open Source Code No The paper states 'We implemented our sampler and parametrizations on top of this code-base for non-autoregressive MT.' and links to a baseline's repository (https://github.com/facebookresearch/Mask-Predict), but does not provide a link or explicit statement for its own open-source code.
Open Datasets Yes Data for NMT: We performed experiments via translating the validation and test sets of the WMT-14 German-English (De-En), and the WMT-16 Romanian-English (Ro-En) datasets and perform the same tokenization and pre/post-processing as Ghazvininejad et al. (2019).
Dataset Splits Yes Data for NMT: We performed experiments via translating the validation and test sets of the WMT-14 German-English (De-En), and the WMT-16 Romanian-English (Ro-En) datasets and perform the same tokenization and pre/post-processing as Ghazvininejad et al. (2019).
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory, or specific cloud computing instances used for running the experiments.
Software Dependencies No The paper mentions using 'Hugging Face s pytorch implementation of uncased BERT-base and BERT-large' but does not specify version numbers for PyTorch or other software dependencies.
Experiment Setup Yes For all the sampling baselines, after a burn-in period of 7 epochs, we ran the Markov chain for at least 26 epochs over the dataset. For the reported experimental settings, we ran 500 chains for 100 epochs to produce 500 sequences of diverse lengths varying from 15 45. We experiment by changing the entropy of the masked conditionals via a temperature hyperparameter T. It involves defining a nucleus boundary b, which prunes out the long tail of the vocabulary. We propose to perform MH sampling from target distributions whose energy values are scaled by low temperatures i.e. p(X; θ, T) e E(X;θ) T . However, such low-entropy target distributions lead to increased rejection rates for the MH samplers. Therefore, we anneal the temperature as a linear function of epochs to gradually decrease the entropy of the target distribution.