A Distributed and Scalable Machine Learning Approach for Big Data
Authors: Hongliang Guo, Jie Zhang
IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments performed on Hadoop confirm that our proposed approach is superior in terms of both testing errors and convergence rate (computation time) over the canonical distributed machine learning techniques that deal with big data. Experimental results on large data sets verify that our approach achieves better convergence rate than four other canonical distributed or big data machine learning algorithms, on both regression and classification tasks. |
| Researcher Affiliation | Academia | Hongliang Guo, and Jie Zhang School of Computer Science and Engineering Nanyang Technological University, Singapore guohl@ntu.edu.sg zhangj@ntu.edu.sg |
| Pseudocode | No | The paper includes mathematical formulations and descriptions of the algorithm, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures. |
| Open Source Code | No | The paper does not contain any statement about providing open-source code or links to a code repository for the methodology described. |
| Open Datasets | Yes | One is the URL Reputation data set [Ma et al., 2009] from the UCI machine learning repository. The other one is the UJIIndoor Loc data set [Joaquin Torres-Sospedra, 2014]. For regression, the first data set is the Relative location of CT slices on axial axis data set [Graf et al., 2011] from the UCI machine learning repository. The second data set we choose is the Blog Feedback data set [Buza, 2014]. |
| Dataset Splits | Yes | We implement all the competing distributed machine learning algorithms in the same environmental context (i.e., the same number of machines, the same configuration of Hadoop and EC2, the same training and testing data set, the same cross validation scheme (10-fold cross validation))... |
| Hardware Specification | Yes | Experimental settings are as follows: amazon instance category r3 large; number of instances 5; operating system Red Hat Enterprise Linux7.1(HVM)-64bit; RAM 15GB; Number of virtual CPUs for each instance 2 |
| Software Dependencies | No | The paper mentions 'Hadoop' as the distributed computation framework but does not provide its specific version number. While 'Red Hat Enterprise Linux7.1(HVM)-64bit' includes a version for the operating system, the absence of a version for the critical 'Hadoop' framework means not all key software components have specified versions for full reproducibility. |
| Experiment Setup | No | The paper mentions using a '10-fold cross validation scheme' and the types of machine learning algorithms (SVM, linear regression), but it does not specify concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed optimizer settings for the models trained. |