Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Unsupervised Pretraining for Fact Verification by Language Model Distillation

Authors: Adrián Bazaga, Pietro Lio, Gos Micklem

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we present a comparative study of the results of our proposed method on standard benchmarks for fact verification, as well as ablation studies on the most relevant components. We first describe the datasets, evaluation and training settings. Next, we discuss extensive experiments on our method for the task of fact verification. Finally, we run a set of ablation studies to evaluate the impact of the most important components of our proposed framework.
Researcher Affiliation	Academia	Adri an Bazaga, Pietro Li o & Gos Micklem University of Cambridge EMAIL
Pseudocode	No	The paper describes its method using text and mathematical equations but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	To guarantee reproducibility of this paper, we release the source code at https://github.com/Adrian BZG/SFAVEL.
Open Datasets	Yes	We use the FEVER (Thorne et al., 2018) dataset for all our experiments and comparison against previous methods. For pre-training we use the official FEVER training set. ... As knowledge base, we use the Wikidata5m (Wang et al., 2021b). ... Finally, we also compare on the FB15k-237 dataset from (Toutanova et al., 2015) in Section A.2 of the Appendix.
Dataset Splits	Yes	For pre-training we use the official FEVER training set. For providing the performance comparisons against previous work, we use the official FEVER testing set. In our ablation studies, we employ the official FEVER development split. To evaluate the learning performance in a low-data regime, we randomly sample 1%, 5% or 10% of the training data.
Hardware Specification	Yes	The batch size is set to 512 over 4 A100 GPUs
Software Dependencies	No	The paper mentions using 'Hugging Face' (Wolf et al., 2020) for initializing pre-trained language models and 'Relational Graph Attention Network' (Busbridge et al., 2019) for the knowledge model, but it does not specify exact version numbers for these or any other software dependencies.
Experiment Setup	Yes	Pre-training is run for a total of 1000 epochs. During training we use a RGAT as the knowledge model with 3 convolutional layers, with a hidden size of 512. The projector from node embeddings to triple embeddings is a MLP with the same dimensionality as the pre-trained language model sentence embedding size. The model is trained with the SGD optimizer with momentum 0.9 and weight decay 0.0001. The batch size is set to 512 over 4 A100 GPUs, and the coefficients for the different losses are λintra = λscoring = 1, λdistill = 2. We set the temperature τ = 0.1. We use K = 5 for the number of facts to keep after scoring. The number of negative instances used in the negative pool for contrastive learning is set to M = 4096. ... The classifier is trained for 200 epochs, using the SGD optimizer with 20 as the initial learning rate.