Analysis of Sampling Algorithms for Twitter

Authors: Deepan Subrahmanian Palguna, Vikas Joshi, Venkatesan Chakaravarthy, Ravi Kothari, LV Subramaniam

IJCAI 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that experiments conducted on real Twitter data agree with our bounds. In these experiments, we also compare different kinds of random sampling algorithms. Our ideas and results would be of interest to data providers and researchers in order to decide how much data to use in their respective applications. We then show our simulations and experimental results with real Twitter data in Section 6.
Researcher Affiliation Collaboration Deepan Palguna, Vikas Joshi, Venkatesan Chakaravarthy, Ravi Kothari and L V Subramaniam School of ECE, Purdue University, Indiana, USA IBM India Research Lab, India dpalguna@purdue.edu, {vijoshij, vechakra, rkothari, lvsubram}@in.ibm.com
Pseudocode No The paper describes algorithms textually but does not provide any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statements about releasing source code or links to a code repository for its methodology.
Open Datasets No Using Twitter s free API, we created a data set based on certain important political events of 2014, which we call the PE-2014 data set. While the dataset is used for experiments, there is no indication that it is publicly available or access information provided.
Dataset Splits No The paper does not specify traditional training, validation, or test dataset splits for model training. It describes conducting '100 Monte Carlo rounds' for evaluating its sampling algorithms, but this is a resampling technique for evaluation, not a fixed data split for a machine learning model.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments, such as CPU or GPU models, memory, or cloud instance specifications.
Software Dependencies No The paper mentions tools like a 'parts of speech tagger' (referencing [Bird et al., 2009], which discusses NLTK and Python) and a 'sentiment analysis algorithm', but it does not specify any software dependencies with version numbers.
Experiment Setup Yes The theoretical bounds are characterized by parameters like L (estimated as 33 words), θ (e.g., 0.1, 0.15, 0.25, 0.35), ϵ (e.g., 0.1), and λ (e.g., 0.15, 0.25). The experiments involve '100 Monte Carlo rounds' and specific sample sizes (e.g., 2000, 4000, 8000, 10000 Tweets).