Predicting the Demographics of Twitter Users from Website Traffic Data
Authors: Aron Culotta, Nirmal Kumar, Jennifer Cutler
AAAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we predict the demographics of Twitter users based on whom they follow. Whereas most prior approaches rely on a supervised learning approach, in which individual users are labeled with demographics, we instead create a distantly labeled dataset by collecting audience measurement data for 1,500 websites (e.g., 50% of visitors to gizmodo.com are estimated to have a bachelor s degree). We then fit a regression model to predict these demographics using information about the followers of each website on Twitter. The resulting average heldout correlation is .77 across six different variables (gender, age, ethnicity, education, income, and child status). We additionally validate the model on a smaller set of Twitter users labeled individually for ethnicity and gender, finding performance that is surprisingly competitive with a fully supervised approach. |
| Researcher Affiliation | Academia | Aron Culotta and Nirmal Kumar Ravi Department of Computer Science Illinois Institute of Technology Chicago, IL 60616 aculotta@iit.edu, nravi@hawk.iit.edu Jennifer Cutler Stuart School of Business Illinois Institute of Technology Chicago, IL 60616 jcutler2@stuart.iit.edu |
| Pseudocode | No | The paper describes the methods used (e.g., elastic net regression, logistic regression) but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available here: https://github.com/tapilab/aaai-2015-demographics. |
| Open Datasets | No | The paper describes how they sampled and collected data from Quantcast.com and Twitter for their experiments. However, it does not provide a public link, a specific citation with author/year for a dataset, or refer to a well-known public dataset name with direct access information for the collected data. |
| Dataset Splits | Yes | We perform five-fold cross-validation and report the held-out correlation coefficient (r) between the predicted and true demographic variables. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | We use the implementation of Multi Task Elastic Net in scikit-learn (Pedregosa and others 2011). |
| Experiment Setup | Yes | After tuning on a validation set for one task, we fix alpha=1e 5 and l1 ratio=0.5. |