DinTucker: Scaling Up Gaussian Process Models on Large Multidimensional Arrays
Authors: Shandian Zhe, Yuan Qi, Youngja Park, Zenglin Xu, Ian Molloy, Suresh Chari
AAAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments demonstrate that DINTUCKER maintains the predictive accuracy of Inf Tucker and is scalable on massive data: On multidimensional arrays with billions of elements from two real-world applications, DINTUCKER achieves significantly higher prediction accuracy with less training time, compared with the state-of-the-art large-scale tensor decomposition method, Giga Tensor. |
| Researcher Affiliation | Collaboration | Shandian Zhe,1 Yuan Qi,1 Youngja Park,2 Zenglin Xu,3 Ian Molloy,2 and Suresh Chari2 1Department of Computer Science, Purdue University 2IBM Thomas J. Watson Research Center 3School of Comp. Sci. & Eng., Big Data Res. Center, University of Electronic Science and Technology of China |
| Pseudocode | No | The paper describes algorithmic steps in prose but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states 'We implemented DINTUCKER with PYTHON and used HADOOP streaming for training and prediction.' but does not explicitly provide a link or statement of open-source code availability for the methodology described. |
| Open Datasets | Yes | The first dataset is NELL, a knowledge base containing triples (e.g., George Harrison , plays Instrument , Guitar ) from the Read the Web project (Carlson et al. 2010). |
| Dataset Splits | Yes | Specifically, we split the nonzero entries into 5 folds and used 4 folds for training. For the test set, we used all the ones in the remaining fold and randomly chose 0.1% zero entries (so that the evaluations will not be overwhelmed by zero elements). We repeated this procedure for 5 times with different training and test sets each time. ... The NELL and ACC datasets contain 0.0001% and 0.003% nonzero entries, respectively. We randomly chose 80% of nonzero entries for training and then, from the remaining entries, we sampled 50 test datasets, each of which consists of 200 nonzero entries and 2, 000 zero entries. |
| Hardware Specification | Yes | We carried out our experiments on a HADOOP cluster. The cluster consists of 16 machines, each of which has a 4-quad Intel Xeon-E3 3.3 GHz CPU, 8 GB RAM, and a 4 Terabyes disk. |
| Software Dependencies | No | The paper states 'We implemented DINTUCKER with PYTHON and used HADOOP streaming for training and prediction.' but does not provide specific version numbers for Python, Hadoop, or any other software dependencies. |
| Experiment Setup | Yes | For DINTUCKER, we set the subarray size to 40 40 40 for Digg1 and Enron, and 20 20 20 20 for Digg2. ... For each strategy, we sampled 1, 500 subarrays for training. We ran our distributed online inference algorithm with 3 mappers, and set the number of iterations to 5. We tuned the learning rate η from the range {0.0005, 0.001, 0.002, 0.005, 0.01}. We used another cross-validation to choose the kernel function from the RBF, linear, Polynomial and Mat ern functions and tuned its hyperparameters. |