HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Authors: Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that pretraining with a BART-style denoising loss directly on simplified HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and fine-tuning for classification benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also find that hyper-text prompts provide more value to HTLM, in terms of data efficiency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data.
Researcher Affiliation Academia Anonymous authors Paper under double-blind review
Pseudocode No The paper describes its methods in prose and uses figures to illustrate concepts, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code No We will release all code and models to support future HTLM research.
Open Datasets Yes Our Hyper Text Language Model (HTLM) is trained on 23TB of simplified HTML which we automatically extract from common crawl dumps (see Section 2.1). We used the January 2021 snapshot of Common Crawl, which provided us with 23 Terabytes of MHTML text after filtering.
Dataset Splits No While the paper uses various datasets (e.g., GLUE, CNN/Dailymail), it does not explicitly provide the train/validation/test split percentages or sample counts for these datasets in a manner that would allow direct reproduction of the data partitioning. For instance, it mentions 'maximum of 50 data points from the train set to evaluate the prompts' and task-specific hyperparameters for GLUE in Table 7, but not the overall dataset splits.
Hardware Specification No We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192. While it mentions '256 GPUs', it does not specify the model or type of GPUs, nor any CPU or memory details.
Software Dependencies No The paper mentions using 'Adam optimizer (Kingma & Ba, 2014)' and 'Fast Text Joulin et al. (2016)' but does not specify software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup Yes We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192. We initialize our model with the original BART-Large model. We train using the Adam optimizer (Kingma & Ba, 2014) and a polynomial decay learning rate scheduler with a peak learning rate of 4e 5 and 10, 000 warm-up steps. We do not use the sentence shuffling from the original BART objective, and select a Poisson λ of 3.5 for sampling span lengths for masking. We set dropout in the attention to 0.1 for the first 170k steps, reducing it to 0.0 thereafter. (Section 2.2). Additionally, Table 7 and Table 8 provide specific hyperparameter values for GLUE and R3F experiments respectively.