reproducibilityindex.ai

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

Authors: Armen Aghajanyan, Dmytro Okhonko, Mike Lewis, Mandar Joshi, Hu Xu, Gargi Ghosh, Luke Zettlemoyer

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show that pretraining with a BART-style denoising loss directly on simpliﬁed HTML provides highly effective transfer for a wide range of end tasks and supervision levels. HTLM matches or exceeds the performance of comparably sized text-only LMs for zero-shot prompting and ﬁne-tuning for classiﬁcation benchmarks, while also setting new state-of-the-art performance levels for zero-shot summarization. We also ﬁnd that hyper-text prompts provide more value to HTLM, in terms of data efﬁciency, than plain text prompts do for existing LMs, and that HTLM is highly effective at auto-prompting itself, by simply generating the most likely hyper-text formatting for any available training data.
Researcher Affiliation	Academia	Anonymous authors Paper under double-blind review
Pseudocode	No	The paper describes its methods in prose and uses figures to illustrate concepts, but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	We will release all code and models to support future HTLM research.
Open Datasets	Yes	Our Hyper Text Language Model (HTLM) is trained on 23TB of simpliﬁed HTML which we automatically extract from common crawl dumps (see Section 2.1). We used the January 2021 snapshot of Common Crawl, which provided us with 23 Terabytes of MHTML text after ﬁltering.
Dataset Splits	No	While the paper uses various datasets (e.g., GLUE, CNN/Dailymail), it does not explicitly provide the train/validation/test split percentages or sample counts for these datasets in a manner that would allow direct reproduction of the data partitioning. For instance, it mentions 'maximum of 50 data points from the train set to evaluate the prompts' and task-specific hyperparameters for GLUE in Table 7, but not the overall dataset splits.
Hardware Specification	No	We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192. While it mentions '256 GPUs', it does not specify the model or type of GPUs, nor any CPU or memory details.
Software Dependencies	No	The paper mentions using 'Adam optimizer (Kingma & Ba, 2014)' and 'Fast Text Joulin et al. (2016)' but does not specify software dependencies like programming languages, libraries, or frameworks with their version numbers (e.g., 'Python 3.8', 'PyTorch 1.9').
Experiment Setup	Yes	We trained our augmented BART model for a total of 330,000 steps on 256 GPUs with an effective batch size of 8192. We initialize our model with the original BART-Large model. We train using the Adam optimizer (Kingma & Ba, 2014) and a polynomial decay learning rate scheduler with a peak learning rate of 4e 5 and 10, 000 warm-up steps. We do not use the sentence shufﬂing from the original BART objective, and select a Poisson λ of 3.5 for sampling span lengths for masking. We set dropout in the attention to 0.1 for the ﬁrst 170k steps, reducing it to 0.0 thereafter. (Section 2.2). Additionally, Table 7 and Table 8 provide specific hyperparameter values for GLUE and R3F experiments respectively.