Streaming Coresets for Symmetric Tensor Factorization
Authors: Rachit Chhaya, Jayesh Choudhari, Anirban Dasgupta, Supratim Shit
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We give empirical results that compare our sampling scheme with other schemes. (Section 1, Introduction); Table 2. Streaming Single Topic Modeling (Section 6, Applications) |
| Researcher Affiliation | Academia | 1Computer Science and Engineering, Indian Institute of Technology Gandhinagar, India. |
| Pseudocode | Yes | Algorithm 1 Score(xi, M, Minv, Q); Algorithm 2 Line Filter; Algorithm 3 Kernel Filter (Section 4, Algorithms and Guarantees) |
| Open Source Code | No | The paper does not provide any specific link or explicit statement about making the source code available. |
| Open Datasets | No | Here we use a subset of 20Newsgroups dataset (preprocessed). We took a subset of 10K documents and considered the 100 most frequent words. We normalized each document vector, such that its ℓ1 norm is 1 and created a matrix A R10K 100. (Section 6, Applications) The paper mentions the 20Newsgroups dataset but does not provide specific access information (link, DOI, or formal citation with authors/year) for their preprocessed subset. |
| Dataset Splits | No | No explicit details on training, validation, or test dataset splits (e.g., percentages, sample counts, or predefined split citations) are provided. The paper describes processing rows 'one at a time' and evaluating the outcomes, but does not use a typical train/validation/test partitioning. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory amounts, or detailed computer specifications) used for running experiments are provided. |
| Software Dependencies | No | No specific software dependencies with version numbers (e.g., library names with versions) are explicitly mentioned. |
| Experiment Setup | Yes | Here we use a subset of 20Newsgroups dataset (preprocessed). We took a subset of 10K documents and considered the 100 most frequent words. We normalized each document vector, such that its ℓ1 norm is 1 and created a matrix A R10K 100. We feed its row one at a time to Line Filter+Kernel Filter with p = 3, which returns a coreset C. We run tensor based single topic modeling (Anandkumar et al., 2014) on A and C, to return 12 top topic distributions from both. We take the best matching between empirical topics and estimated topics based on ℓ1 distance and compute the average ℓ1 difference between them. Here smaller is better. We run this entire method 5 times and report the median of their ℓ1 average differences. (Section 6, Applications) |