Post: Device Placement with Cross-Entropy Minimization and Proximal Policy Optimization

Authors: Yuanxiang Gao, Li Chen, Baochun Li

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have implemented Post in the Google Cloud platform, and our extensive experiments with several popular neural network training benchmarks have demonstrated clear evidence of superior performance: with the same amount of learning time, it leads to placements that have training times up to 63.7% shorter over the state-of-the-art.
Researcher Affiliation Academia Yuanxiang Gao1,2 Li Chen 3 Baochun Li 1 1 Department of Electrical and Computer Engineering, University of Toronto 2 School of Information and Communication Engineering, University of Electronic Science and Technology of China 3 School of Computing and Informatics, University of Louisiana at Lafayette yuanxiang@ece.utoronto.ca, li.chen@louisiana.edu, bli@ece.toronto.edu
Pseudocode Yes Algorithm 1 Post: Joint Policy Optimization 1: Initialize parameters u(0) as all zeros; Initialize t = 0; 2: for n = 1, 2, . . . , L do 3: Sample a placement d(n) f(d|u(t)); 4: Train the DNN under d(n) and recording T(d(n)); 5: if n%K == 0 and n%N = 0 then 6: Perform several (e.g., 10) stochastic gradient ascent steps w.r.t. the objective of proximal policy optimization in Eq. (11); 7: t = t + 1 8: end if 9: if n%N == 0 then 10: Solve the cross-entropy minimization using Eq. (10) to achieve a global minimum; 11: t = t + 1 12: Mix the new distribution f(dm|u(t+1) m ) = (1 ϵ)f(dm|u(t+1) m ) + ϵ 1 D, m; 13: end if 14: end for
Open Source Code No The paper does not provide a direct link or an explicit statement about the availability of the source code for the Post algorithm.
Open Datasets Yes Inception-V3 and Res Net are two popular deep convolutional neural networks trained on the Image Net dataset using a batch size of 32. RNNLM is a 4-layer RNN trained with a batch size of 64. NMT is a sequence-to-sequence encoder-decoder architecture trained on the WMT16 dataset with a batch size of 64.
Dataset Splits No The paper does not explicitly mention the use of a validation dataset split or provide details on how data was partitioned into training, validation, and test sets. It describes training for a number of steps to measure performance but not data splits for model evaluation.
Hardware Specification Yes We have conducted our experiments with 12 machines on the Google Cloud platform. Each machine is equipped with 26 GB of main memory, an Intel Broadwell 8-core CPU and 2, 4 or 8 NVIDIA Tesla K80 GPUs, each with 11 GB of memory.
Software Dependencies No The paper mentions using 'TensorFlow' for benchmarks but does not specify its version number or any other software dependencies with version details.
Experiment Setup Yes Our detailed setting of the parameters in Algorithm 1 is as follows: the learning rate of SGA is 1; the ratio for choosing promising placements is 0.1 (6 best placements out of 60 samples over 5 iterations); the exploration factor ϵ is 0.1, which is linearly reduced to zero during learning; the KL penalty coefficient β is initialized as 1 and adapted based on a target KL divergence of 0.03.