Distributed Pareto Optimization for Subset Selection

Authors: Chao Qian, Guiying Li, Chao Feng, Ke Tang

IJCAI 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments using Spark, on various real-world data sets with size ranging from thousands to millions, show that DPOSS can achieve competitive performance compared with the centralized POSS, and is almost always better than the state-of-the-art distributed greedy algorithm RANDGREEDI.
Researcher Affiliation Academia 1 Anhui Province Key Lab of Big Data Analysis and Application, University of Science and Technology of China, Hefei 230027, China 2 Shenzhen Key Lab of Computational Intelligence, Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China chaoqian@ustc.edu.cn, {lgy147, chaofeng}@mail.ustc.edu.cn, tangk3@sustc.edu.cn
Pseudocode Yes Algorithm 1 POSS Algorithm Input: all items V = {v1, v2, . . . , vn}, an objective function f : 2V R and a budget k Parameter: the number T of iterations Output: a subset of V with at most k items Process: 1: Let s = {0}n and P = {s}. 2: Let t = 0. 3: while t < T do 4: Select s from P uniformly at random. 5: Generate s by flipping each bit of s with prob. 1/n. 6: if z P such that z s then 7: P = (P \ {z P | s z}) {s }. 8: end if 9: t = t + 1. 10: end while 11: return arg maxs P,|s| k f(s)
Open Source Code No The paper states that 'All of the experiments are coded in Python 2.7.14' but does not provide any link to the source code for DPOSS or explicitly state that it is made publicly available.
Open Datasets Yes The data sets are downloaded from http://fimi.ua.ac.be/data/, https://snap.stanford.edu/data/, https://turing.cs.hbg.psu.edu/txn131/vertex_cover.html and http://sites.nlsde.buaa.edu.cn/kexu/benchmarks/graph-benchmarks.htm. The data sets are downloaded from https://archive.ics.uci.edu/ml/datasets.html and https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/.
Dataset Splits No The paper describes how the ground set V is partitioned among 'm' machines for distributed processing ('partition by assigning items uniformly randomly to the reducers'), but it does not specify explicit training, validation, or test dataset splits with percentages or sample counts for model evaluation.
Hardware Specification No The paper states that experiments were run on 'a cluster of 10 quad-core machines', but it does not provide specific hardware details such as CPU models, GPU models, or memory specifications.
Software Dependencies Yes All of the experiments are coded in Python 2.7.14 and run on an identical configuration: a cluster of 10 quad-core machines running Spark 2.0.2.
Experiment Setup Yes To implement DPOSS, each machine runs POSS for 2ek2N iterations, where N is the number of items allocated to that machine, as suggested in [Qian et al., 2015; 2017b]. The budget k is set to 8. For regular-scale data sets, we set the number m of reducers to {1, 2, . . . , 10}. For large-scale data sets, m is set to 300