Scalable and Generalizable Social Bot Detection through Data Selection
Authors: Kai-Cheng Yang, Onur Varol, Pik-Mai Hui, Filippo Menczer1096-1103
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we propose a framework that uses minimal account metadata, enabling efficient analysis that scales up to handle the full stream of public tweets of Twitter in real time. To ensure model accuracy, we build a rich collection of labeled datasets for training and validation. We deploy a strict validation system so that model performance on unseen datasets is also optimized, in addition to traditional cross-validation. |
| Researcher Affiliation | Academia | 1Center for Complex Networks and Systems Research, Indiana University, Bloomington, IN, USA 2Center for Complex Networks Research, Northeastern University, Boston, MA, USA 3Indiana University Network Science Institute, Bloomington, IN, USA |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not explicitly state that the source code for the proposed framework is publicly available, nor does it provide a direct link to such code. |
| Open Datasets | Yes | To train and test our model, we collect all public datasets of labeled human and bot accounts and create three new ones, all available in the bot repository (botometer.org/ bot-repository). |
| Dataset Splits | Yes | Random forest classifiers with 100 trees are trained on those 247 combinations, yielding as many candidate models. We record the AUC of each model via five-fold cross-validation. |
| Hardware Specification | Yes | Our classifier was implemented in Python with scikit-learn (Pedregosa et al. 2011) and run on a machine with an Intel Core i7-3770 CPU (3.40GHz) and 8GB RAM. |
| Software Dependencies | No | The paper mentions 'Python with scikit-learn' but does not specify version numbers for Python or scikit-learn, nor does it list other specific software dependencies with versions. |
| Experiment Setup | Yes | Random forest classifiers with 100 trees are trained on those 247 combinations, yielding as many candidate models. [...] Random forest generates a score between 0 and 1 to estimate the likelihood of an account exhibiting bot-like behavior. If we need a binary classifier, we can use a threshold. Fig. 4 illustrates the thresholds that maximize precision and recall (via the F1 metric). |