Communication Efficient Distributed Machine Learning with the Parameter Server

Authors: Mu Li, David G Andersen, Alexander J Smola, Kai Yu

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present an in-depth analysis of two large scale machine learning problems ranging from ℓ1-regularized logistic regression on CPUs to reconstruction ICA on GPUs, using 636TB of real data with hundreds of billions of samples and dimensions. We demonstrate using these examples that the parameter server framework is an effective and straightforward way to scale machine learning to larger problems and systems than have been previously achieved.
Researcher Affiliation Collaboration Mu Li , David G. Andersen , Alexander Smola , and Kai Yu Carnegie Mellon University Baidu Google {muli, dga}@cs.cmu.edu, alex@smola.org, yukai@baidu.com
Pseudocode Yes Algorithm 1 Distributed Subgradient Descent Solving (1) in the Parameter Server
Open Source Code Yes Finally, the source codes are available at http://parameterserver.org.
Open Datasets No We collected an ad click prediction dataset with 170 billion samples and 65 billion unique features. The uncompressed dataset size is 636TB.
Dataset Splits No The paper mentions data partitioning for distributed processing (e.g., 'training data is partitioned and distributed among all the workers'), but it does not specify train/validation/test dataset splits, percentages, or methodology for reproducibility.
Hardware Specification Yes We ran the parameter server on 1000 machines, each with 16 CPU cores, 192GB DRAM, and connected by 10 Gb Ethernet.
Software Dependencies No The paper mentions several related systems and frameworks (e.g., Hadoop, Spark, Mahout, Graphlab) in its background, but it does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x) required to replicate their experimental setup or run their code.
Experiment Setup Yes We adopted Algorithm 2 with upper bounds of the diagonal entries of the Hessian as the coordinate-specific learning rates. Features were randomly split into 580 blocks according the feature group information. We chose a fixed learning rate by observing the convergence speed.