reproducibilityindex.ai

Scalable and Generalizable Social Bot Detection through Data Selection

Authors: Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, Filippo Menczer1096-1103

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we propose a framework that uses minimal account metadata, enabling efﬁcient analysis that scales up to handle the full stream of public tweets of Twitter in real time. To ensure model accuracy, we build a rich collection of labeled datasets for training and validation. We deploy a strict validation system so that model performance on unseen datasets is also optimized, in addition to traditional cross-validation.
Researcher Affiliation	Academia	1Center for Complex Networks and Systems Research, Indiana University, Bloomington, IN, USA 2Center for Complex Networks Research, Northeastern University, Boston, MA, USA 3Indiana University Network Science Institute, Bloomington, IN, USA
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not explicitly state that the source code for the proposed framework is publicly available, nor does it provide a direct link to such code.
Open Datasets	Yes	To train and test our model, we collect all public datasets of labeled human and bot accounts and create three new ones, all available in the bot repository (botometer.org/ bot-repository).
Dataset Splits	Yes	Random forest classiﬁers with 100 trees are trained on those 247 combinations, yielding as many candidate models. We record the AUC of each model via ﬁve-fold cross-validation.
Hardware Specification	Yes	Our classiﬁer was implemented in Python with scikit-learn (Pedregosa et al. 2011) and run on a machine with an Intel Core i7-3770 CPU (3.40GHz) and 8GB RAM.
Software Dependencies	No	The paper mentions 'Python with scikit-learn' but does not specify version numbers for Python or scikit-learn, nor does it list other specific software dependencies with versions.
Experiment Setup	Yes	Random forest classiﬁers with 100 trees are trained on those 247 combinations, yielding as many candidate models. [...] Random forest generates a score between 0 and 1 to estimate the likelihood of an account exhibiting bot-like behavior. If we need a binary classiﬁer, we can use a threshold. Fig. 4 illustrates the thresholds that maximize precision and recall (via the F1 metric).