reproducibilityindex.ai

Distributed Pareto Optimization for Subset Selection

Authors: Chao Qian, Guiying Li, Chao Feng, Ke Tang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RANDGREEDI.
Researcher Affiliation	Academia	1 Anhui Province Key Lab of Big Data Analysis and Application, University of Science and Technology of China, Hefei 230027, China 2 Shenzhen Key Lab of Computational Intelligence, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China chaoqian@ustc.edu.cn, {lgy147, chaofeng}@mail.ustc.edu.cn, tangk3@sustc.edu.cn
Pseudocode	Yes	Algorithm 1 POSS Algorithm Input: all items V = {v1, v2, . . . , vn}, an objective function f : 2V R and a budget k Parameter: the number T of iterations Output: a subset of V with at most k items Process: 1: Let s = {0}n and P = {s}. 2: Let t = 0. 3: while t < T do 4: Select s from P uniformly at random. 5: Generate s by ﬂipping each bit of s with prob. 1/n. 6: if z P such that z s then 7: P = (P \ {z P \| s z}) {s }. 8: end if 9: t = t + 1. 10: end while 11: return arg maxs P,\|s\| k f(s)
Open Source Code	No	The paper states that 'All of the experiments are coded in Python 2.7.14' but does not provide any link to the source code for DPOSS or explicitly state that it is made publicly available.
Open Datasets	Yes	The data sets are downloaded from http://fimi.ua.ac.be/data/, https://snap.stanford.edu/data/, https://turing.cs.hbg.psu.edu/txn131/vertex_cover.html and http://sites.nlsde.buaa.edu.cn/kexu/benchmarks/graph-benchmarks.htm. The data sets are downloaded from https://archive.ics.uci.edu/ml/datasets.html and https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.
Dataset Splits	No	The paper describes how the ground set V is partitioned among 'm' machines for distributed processing ('partition by assigning items uniformly randomly to the reducers'), but it does not specify explicit training, validation, or test dataset splits with percentages or sample counts for model evaluation.
Hardware Specification	No	The paper states that experiments were run on 'a cluster of 10 quad-core machines', but it does not provide specific hardware details such as CPU models, GPU models, or memory specifications.
Software Dependencies	Yes	All of the experiments are coded in Python 2.7.14 and run on an identical conﬁguration: a cluster of 10 quad-core machines running Spark 2.0.2.
Experiment Setup	Yes	To implement DPOSS, each machine runs POSS for 2ek2N iterations, where N is the number of items allocated to that machine, as suggested in [Qian et al., 2015; 2017b]. The budget k is set to 8. For regular-scale data sets, we set the number m of reducers to {1, 2, . . . , 10}. For large-scale data sets, m is set to 300