Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Predicting the Demographics of Twitter Users from Website Traffic Data

Authors: Aron Culotta, Nirmal Kumar, Jennifer Cutler

AAAI 2015 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we predict the demographics of Twitter users based on whom they follow. Whereas most prior approaches rely on a supervised learning approach, in which individual users are labeled with demographics, we instead create a distantly labeled dataset by collecting audience measurement data for 1,500 websites (e.g., 50% of visitors to gizmodo.com are estimated to have a bachelor s degree). We then ﬁt a regression model to predict these demographics using information about the followers of each website on Twitter. The resulting average heldout correlation is .77 across six different variables (gender, age, ethnicity, education, income, and child status). We additionally validate the model on a smaller set of Twitter users labeled individually for ethnicity and gender, ﬁnding performance that is surprisingly competitive with a fully supervised approach.
Researcher Affiliation	Academia	Aron Culotta and Nirmal Kumar Ravi Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 EMAIL, EMAIL Jennifer Cutler Stuart School of Business Illinois Institute of Technology Chicago, IL 60616 EMAIL
Pseudocode	No	The paper describes the methods used (e.g., elastic net regression, logistic regression) but does not include any pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available here: https://github.com/tapilab/aaai-2015-demographics.
Open Datasets	No	The paper describes how they sampled and collected data from Quantcast.com and Twitter for their experiments. However, it does not provide a public link, a specific citation with author/year for a dataset, or refer to a well-known public dataset name with direct access information for the collected data.
Dataset Splits	Yes	We perform ﬁve-fold cross-validation and report the held-out correlation coefﬁcient (r) between the predicted and true demographic variables.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments.
Software Dependencies	No	We use the implementation of Multi Task Elastic Net in scikit-learn (Pedregosa and others 2011).
Experiment Setup	Yes	After tuning on a validation set for one task, we ﬁx alpha=1e 5 and l1 ratio=0.5.