Cheaper and Faster: Distributed Deep Reinforcement Learning with Serverless Computing

Authors: Hanfei Yu, Jian Li, Yang Hua, Xu Yuan, Hao Wang

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that MINIONSRL reduces total training time by up to 52% and training cost by 86% compared to latest solutions.
Researcher Affiliation Academia Hanfei Yu1, Jian Li2, Yang Hua3, Xu Yuan4, Hao Wang1 1Louisiana State University 2Stony Brook University 3Queen s University Belfast 4University of Delaware {hyu25, haowang}@lsu.edu, jian.li.3@stonybrook.edu, Y.Hua@qub.ac.uk, xyuan@udel.edu
Pseudocode No The paper does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the code for the described methodology.
Open Datasets Yes Six environments from Open AI Gym are used to evaluate MINIONSRL and other baselines, including three continuous-action Mu Jo Co environments (Hopperv3, Humanoid-v3, and Half Cheetah-v3) and three discreteaction Atari environments (Space Invaders No Frameskip-v4, Qbert No Frameskip-v4, and Gravitar No Frameskip-v4).
Dataset Splits No The paper mentions using 'Six environments from Open AI Gym' but does not specify explicit training, validation, or test dataset splits (e.g., percentages or counts) or reference predefined splits for reproducibility.
Hardware Specification Yes We deploy all server-based baselines to a cluster of Azure VMs: one Standard NC6s v3 virtual machine (VM) and four Standard E16-8s v5 VMs. The cluster contains one NVIDIA V100 GPU and four 8-core Intel Xeon Platinum CPUs (in total 32 cores) for training DRL workloads. MINIONSRL is prototyped on Azure Container Instances (ACI) (Azure Container Instances 2022). When training DRL workloads with MINIONSRL, according to our workload profiling, each learner container is configured with one V100 GPU and each actor container is with one CPU core, respectively.
Software Dependencies No The paper mentions several software components like Ray library, Ray-RLlib, Ray-Tune, PPO algorithm, Adam optimizer, and gRPC, but does not specify their version numbers.
Experiment Setup Yes Table 1: Hyperparameters of PPO used in the training workloads and the search ranges of the scheduler. Learning rate 0.00005, Discount factor (γ) 0.99, Mini-batch size 256, Clip parameter 0.3, KL coefficient 0.2, KL target 0.01, Entropy coefficient 0.0, Value function coefficient 1.0. For Mu Jo Co, the policy network consists of two fully-connected layers of 256 hidden units with Tanh activation. For Atari, the policy network consists of three convolutional layers of 8 8, 4 4, and 11 11 kernel sizes with Re LU activation, respectively.