Large Scale Homophily Analysis in Twitter Using a Twixonomy
Authors: Stefano Faralli, Giovanni Stilo, Paola Velardi
IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we perform a large-scale homophily analysis on Twitter using a hierarchical representation of users interests which we call a Twixonomy. Then, to analyze homophily, we compare different methods to detect communities in a peer friends Twitter network, and then for each community we compute the degree of homophily on the basis of a measure of pairwise semantic similarity. Table 1 shows some network statistics. Figure 1 shows the coverage of the Twitter 2009 and NY-Twitter 2014 populations as a function of the number of expressed interests. Figure 4 is obtained by considering the whole set of clusters jointly extracted by the three clustering methods (Infomap, K-core and Ego), and then computing avg2(Sem(A, B)) as a function of the maximum considered generality level Lk in the Twixonomy, for each of the three strategies: Clique, Connected, and Random. |
| Researcher Affiliation | Academia | Stefano Faralli, Giovanni Stilo and Paola Velardi Sapienza University of Rome Dipartimento di Informatica {faralli,stilo,velardi}@di.uniroma1.it |
| Pseudocode | Yes | Algorithm 1 Build Twixonomy. Input: F = twitter users followed by at least one member of the initial Twitter population P CG: top category hierarchy from Wikipedia. Output: a DAG taxonomy where: i) leaf nodes are wikipages associated to Twitter topical users, and the remaining nodes are Wikipedia categories; ii) edges are one of three kinds: <supercategory , category>, <category , wikipage>, <wikipage , Twitter topical user>. Algorithm 2 Remove Cycles. Input: a directed GRAPH G Output: a DIRECTED ACYCLIC GRAPH (DAG) |
| Open Source Code | No | The paper states that "Both Babel Net and Babelfy are available on-line2; 2http://babelnet.org/", referring to third-party tools used in their research. It does not provide any statement or link for the open-source code of their own methodology (e.g., the Twixonomy construction or homophily analysis). |
| Open Datasets | Yes | i) The Twitter 2009 network: The authors in Kwak et al. [2010] crawled and released the entire Twitter network as of July 2009. iv) The Wikipedia Graph: We created the Wikipedia graph from the Wikipedia dump in 2009 and 2014 (for consistency with the two Twitter population datasets). |
| Dataset Splits | No | The paper does not explicitly provide training/validation/test dataset splits in the conventional machine learning sense. It describes two distinct datasets (Twitter 2009 and NY-Twitter 2014) used for analysis and filters clusters based on size ("In our analysis, we consider only clusters in which the number of members is between 50 and 1000"), but it does not specify data partitioning into train, validation, or test sets. |
| Hardware Specification | No | The paper mentions: "In practice, on the very large Wikipedia graph obtained when starting from the Twitter 2009 population, the algorithm was able to remove all cycles in 12 hours, while all the previously cited cycle detection algorithms either saturated the memory or could not return a solution after six days when using a mid-high level desktop computer." The description "mid-high level desktop computer" is too vague and does not provide specific hardware details (e.g., CPU/GPU model, RAM). |
| Software Dependencies | No | The paper mentions software tools like "Babelfy [Moro et al., 2014]" and "Babel Net [Navigli and Ponzetto, 2012]" as resources used. However, it does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | To measure pairwise semantic similarity we define the following formula... Finally w(d) = β e α (d+1) is a weight function with exponential decay, where we empirically set β = 2 and α = 0.5. To identify communities, we first extract a mutual follow mutual mention network MFM from each of the two Twitter populations P. Then, we detect communities in MFM using three alternative community detection methods: 1) Infomap [Rosvall and Bergstrom, 2008]; 2) A variant of K-core decomposition [Seidman, 1983]; 3) Ego networks. In our analysis, we consider only clusters in which the number of members is between 50 and 1000. |