Streaming Coresets for Symmetric Tensor Factorization

Authors: Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, Supratim Shit

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We give empirical results that compare our sampling scheme with other schemes. (Section 1, Introduction); Table 2. Streaming Single Topic Modeling (Section 6, Applications)
Researcher Affiliation Academia 1Computer Science and Engineering, Indian Institute of Technology Gandhinagar, India.
Pseudocode Yes Algorithm 1 Score(xi, M, Minv, Q); Algorithm 2 Line Filter; Algorithm 3 Kernel Filter (Section 4, Algorithms and Guarantees)
Open Source Code No The paper does not provide any specific link or explicit statement about making the source code available.
Open Datasets No Here we use a subset of 20Newsgroups dataset (preprocessed). We took a subset of 10K documents and considered the 100 most frequent words. We normalized each document vector, such that its ℓ1 norm is 1 and created a matrix A R10K 100. (Section 6, Applications) The paper mentions the 20Newsgroups dataset but does not provide specific access information (link, DOI, or formal citation with authors/year) for their preprocessed subset.
Dataset Splits No No explicit details on training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split citations) are provided. The paper describes processing rows 'one at a time' and evaluating the outcomes, but does not use a typical train/validation/test partitioning.
Hardware Specification No No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided.
Software Dependencies No No specific software dependencies with version numbers (e.g., library names with versions) are explicitly mentioned.
Experiment Setup Yes Here we use a subset of 20Newsgroups dataset (preprocessed). We took a subset of 10K documents and considered the 100 most frequent words. We normalized each document vector, such that its ℓ1 norm is 1 and created a matrix A R10K 100. We feed its row one at a time to Line Filter+Kernel Filter with p = 3, which returns a coreset C. We run tensor based single topic modeling (Anandkumar et al., 2014) on A and C, to return 12 top topic distributions from both. We take the best matching between empirical topics and estimated topics based on ℓ1 distance and compute the average ℓ1 difference between them. Here smaller is better. We run this entire method 5 times and report the median of their ℓ1 average differences. (Section 6, Applications)