Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Predicting the Demographics of Twitter Users from Website Traffic Data
Authors: Aron Culotta, Nirmal Kumar, Jennifer Cutler
AAAI 2015 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we predict the demographics of Twitter users based on whom they follow. Whereas most prior approaches rely on a supervised learning approach, in which individual users are labeled with demographics, we instead create a distantly labeled dataset by collecting audience measurement data for 1,500 websites (e.g., 50% of visitors to gizmodo.com are estimated to have a bachelor s degree). We then fit a regression model to predict these demographics using information about the followers of each website on Twitter. The resulting average heldout correlation is .77 across six different variables (gender, age, ethnicity, education, income, and child status). We additionally validate the model on a smaller set of Twitter users labeled individually for ethnicity and gender, finding performance that is surprisingly competitive with a fully supervised approach. |
| Researcher Affiliation | Academia | Aron Culotta and Nirmal Kumar Ravi Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 EMAIL, EMAIL Jennifer Cutler Stuart School of Business Illinois Institute of Technology Chicago, IL 60616 EMAIL |
| Pseudocode | No | The paper describes the methods used (e.g., elastic net regression, logistic regression) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available here: https://github.com/tapilab/aaai-2015-demographics. |
| Open Datasets | No | The paper describes how they sampled and collected data from Quantcast.com and Twitter for their experiments. However, it does not provide a public link, a specific citation with author/year for a dataset, or refer to a well-known public dataset name with direct access information for the collected data. |
| Dataset Splits | Yes | We perform five-fold cross-validation and report the held-out correlation coefficient (r) between the predicted and true demographic variables. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | We use the implementation of Multi Task Elastic Net in scikit-learn (Pedregosa and others 2011). |
| Experiment Setup | Yes | After tuning on a validation set for one task, we fix alpha=1e 5 and l1 ratio=0.5. |