reproducibilityindex.ai

A Distributed and Scalable Machine Learning Approach for Big Data

Authors: Hongliang Guo, Jie Zhang

IJCAI 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments performed on Hadoop conﬁrm that our proposed approach is superior in terms of both testing errors and convergence rate (computation time) over the canonical distributed machine learning techniques that deal with big data. Experimental results on large data sets verify that our approach achieves better convergence rate than four other canonical distributed or big data machine learning algorithms, on both regression and classiﬁcation tasks.
Researcher Affiliation	Academia	Hongliang Guo, and Jie Zhang School of Computer Science and Engineering Nanyang Technological University, Singapore guohl@ntu.edu.sg zhangj@ntu.edu.sg
Pseudocode	No	The paper includes mathematical formulations and descriptions of the algorithm, but no clearly labeled 'Pseudocode' or 'Algorithm' blocks or figures.
Open Source Code	No	The paper does not contain any statement about providing open-source code or links to a code repository for the methodology described.
Open Datasets	Yes	One is the URL Reputation data set [Ma et al., 2009] from the UCI machine learning repository. The other one is the UJIIndoor Loc data set [Joaquin Torres-Sospedra, 2014]. For regression, the ﬁrst data set is the Relative location of CT slices on axial axis data set [Graf et al., 2011] from the UCI machine learning repository. The second data set we choose is the Blog Feedback data set [Buza, 2014].
Dataset Splits	Yes	We implement all the competing distributed machine learning algorithms in the same environmental context (i.e., the same number of machines, the same conﬁguration of Hadoop and EC2, the same training and testing data set, the same cross validation scheme (10-fold cross validation))...
Hardware Specification	Yes	Experimental settings are as follows: amazon instance category r3 large; number of instances 5; operating system Red Hat Enterprise Linux7.1(HVM)-64bit; RAM 15GB; Number of virtual CPUs for each instance 2
Software Dependencies	No	The paper mentions 'Hadoop' as the distributed computation framework but does not provide its specific version number. While 'Red Hat Enterprise Linux7.1(HVM)-64bit' includes a version for the operating system, the absence of a version for the critical 'Hadoop' framework means not all key software components have specified versions for full reproducibility.
Experiment Setup	No	The paper mentions using a '10-fold cross validation scheme' and the types of machine learning algorithms (SVM, linear regression), but it does not specify concrete hyperparameter values (e.g., learning rate, batch size, number of epochs) or detailed optimizer settings for the models trained.