A Distributed and Scalable Machine Learning Approach for Big Data

Authors: Hongliang Guo, Jie Zhang

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments performed on Hadoop confirm that our proposed approach is superior in terms of both testing errors and convergence rate (computation time) over the canonical distributed machine learning techniques that deal with big data. Experimental results on large data sets verify that our approach achieves better convergence rate than four other canonical distributed or big data machine learning algorithms, on both regression and classification tasks.
Researcher Affiliation Academia Hongliang Guo, and Jie Zhang School of Computer Science and Engineering Nanyang Technological University, Singapore guohl@ntu.edu.sg zhangj@ntu.edu.sg
Pseudocode No The paper includes mathematical formulations and descriptions of the algorithm, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code No The paper does not contain any statement about providing open-source code or links to a code repository for the methodology described.
Open Datasets Yes One is the URL Reputation data set [Ma et al., 2009] from the UCI machine learning repository. The other one is the UJIIndoor Loc data set [Joaquin Torres-Sospedra, 2014]. For regression, the first data set is the Relative location of CT slices on axial axis data set [Graf et al., 2011] from the UCI machine learning repository. The second data set we choose is the Blog Feedback data set [Buza, 2014].
Dataset Splits Yes We implement all the competing distributed machine learning algorithms in the same environmental context (i.e., the same number of machines, the same configuration of Hadoop and EC2, the same training and testing data set, the same cross validation scheme (10-fold cross validation))...
Hardware Specification Yes Experimental settings are as follows: amazon instance category r3 large; number of instances 5; operating system Red Hat Enterprise Linux7.1(HVM)-64bit; RAM 15GB; Number of virtual CPUs for each instance 2
Software Dependencies No The paper mentions 'Hadoop' as the distributed computation framework but does not provide its specific version number. While 'Red Hat Enterprise Linux7.1(HVM)-64bit' includes a version for the operating system, the absence of a version for the critical 'Hadoop' framework means not all key software components have specified versions for full reproducibility.
Experiment Setup No The paper mentions using a '10-fold cross validation scheme' and the types of machine learning algorithms (SVM, linear regression), but it does not specify concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed optimizer settings for the models trained.