Capturing the Style of Fake News

Authors: Piotr Przybyla490-497

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this study we aim to explore automatic methods that can detect online documents of low credibility, especially fake news, based on the style they are written in. We show that general-purpose text classifiers, despite seemingly good performance when evaluated simplistically, in fact overfit to sources of documents in training data. In order to achieve a truly style-based prediction, we gather a corpus of 103,219 documents from 223 online sources labelled by media experts, devise realistic evaluation scenarios and design two new classifiers: a neural network and a model based on stylometric features. The evaluation shows that the proposed classifiers maintain high accuracy in case of documents on previously unseen topics (e.g. new events) and from previously unseen sources (e.g. emerging news websites).
Researcher Affiliation Academia Piotr Przybyła Institute of Computer Science, Polish Academy of Sciences Warsaw, Poland piotr.przybyla@ipipan.waw.pl
Pseudocode No The paper describes the algorithms and models used (stylometric classifier, Bi LSTMAvg neural network, Bag of words, BERT) but does not include any structured pseudocode or algorithm blocks.
Open Source Code Yes In order to encourage and facilitate further research, we make the corpus, the evaluation scenarios and the code (for the stylometric and neural classifiers) available online1. 1https://github.com/piotrmp/fakestyle
Open Datasets Yes In order to encourage and facilitate further research, we make the corpus, the evaluation scenarios and the code (for the stylometric and neural classifiers) available online1. 1https://github.com/piotrmp/fakestyle
Dataset Splits Yes The main evaluation procedure involves running the model construction and prediction in a 5-fold cross validation (CV) scenario and comparing its output to true labels.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running the experiments. It only mentions general computing environments indirectly like
Software Dependencies No The paper mentions several software components like Stanford Core NLP, Mallet, word2vec, glmnet package in R, TensorFlow, and BERT, but it does not specify concrete version numbers for these dependencies, which are necessary for full reproducibility. For example, it lists
Experiment Setup Yes The neural network is implemented and trained in Tensor Flow for 10 epochs with sentence length limited to 120 tokens and document length limited to 50 sentences.