Unsupervised Pretraining for Fact Verification by Language Model Distillation

Authors: Adrián Bazaga, Pietro Lio, Gos Micklem

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we present a comparative study of the results of our proposed method on standard benchmarks for fact verification, as well as ablation studies on the most relevant components. We first describe the datasets, evaluation and training settings. Next, we discuss extensive experiments on our method for the task of fact verification. Finally, we run a set of ablation studies to evaluate the impact of the most important components of our proposed framework.
Researcher Affiliation Academia Adri an Bazaga, Pietro Li o & Gos Micklem University of Cambridge {ar989,pl219,gm263}@cam.ac.uk
Pseudocode No The paper describes its method using text and mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code Yes To guarantee reproducibility of this paper, we release the source code at https://github.com/Adrian BZG/SFAVEL.
Open Datasets Yes We use the FEVER (Thorne et al., 2018) dataset for all our experiments and comparison against previous methods. For pre-training we use the official FEVER training set. ... As knowledge base, we use the Wikidata5m (Wang et al., 2021b). ... Finally, we also compare on the FB15k-237 dataset from (Toutanova et al., 2015) in Section A.2 of the Appendix.
Dataset Splits Yes For pre-training we use the official FEVER training set. For providing the performance comparisons against previous work, we use the official FEVER testing set. In our ablation studies, we employ the official FEVER development split. To evaluate the learning performance in a low-data regime, we randomly sample 1%, 5% or 10% of the training data.
Hardware Specification Yes The batch size is set to 512 over 4 A100 GPUs
Software Dependencies No The paper mentions using 'Hugging Face' (Wolf et al., 2020) for initializing pre-trained language models and 'Relational Graph Attention Network' (Busbridge et al., 2019) for the knowledge model, but it does not specify exact version numbers for these or any other software dependencies.
Experiment Setup Yes Pre-training is run for a total of 1000 epochs. During training we use a RGAT as the knowledge model with 3 convolutional layers, with a hidden size of 512. The projector from node embeddings to triple embeddings is a MLP with the same dimensionality as the pre-trained language model sentence embedding size. The model is trained with the SGD optimizer with momentum 0.9 and weight decay 0.0001. The batch size is set to 512 over 4 A100 GPUs, and the coefficients for the different losses are λintra = λscoring = 1, λdistill = 2. We set the temperature τ = 0.1. We use K = 5 for the number of facts to keep after scoring. The number of negative instances used in the negative pool for contrastive learning is set to M = 4096. ... The classifier is trained for 200 epochs, using the SGD optimizer with 20 as the initial learning rate.