Variance Reduction for Reinforcement Learning in Input-Driven Environments

Authors: Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental results show that across environments from queuing systems, computer networks, and Mu Jo Co robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.
Researcher Affiliation Academia Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh MIT Computer Science and Artificial Intelligence Laboratory {hongzi,bjjvnkt,malte,alizadeh}@csail.mit.edu
Pseudocode Yes Algorithm 1 Training a meta input-dependent baseline for policy-based methods.
Open Source Code No The paper references third-party open-source libraries (e.g., OpenAI Baselines) but does not provide a link or statement confirming that the source code for their specific methodology (input-dependent baselines and meta-learning approach) is publicly available.
Open Datasets Yes We simulate real-world video streaming using public cellular network data (Riiser et al., 2013)
Dataset Splits No The paper mentions evaluating on "100 unseen testing input sequences" but does not specify explicit training, validation, and test dataset splits (e.g., percentages or exact counts) for overall data partitioning.
Hardware Specification No The paper mentions using the MuJoCo physics engine and OpenAI Gym but does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance specifications used for running the experiments.
Software Dependencies Yes We use the Mu Jo Co physics engine (Todorov et al., 2012) in Open AI Gym (Brockman et al., 2016)... Open AI Baselines. https://github.com/openai/baselines, 2017... The activation function is Re LU (Nair & Hinton, 2010) and the optimizer is Adam (Chilimbi et al., 2014).
Experiment Setup Yes We use γ = 0.995 for both environments. The actor and the critic networks have 2 hidden layers, with 64 and 32 hidden neurons on each. The activation function is Re LU (Nair & Hinton, 2010) and the optimizer is Adam (Chilimbi et al., 2014). We train the policy with 16 (synchronous) parallel agents. The learning rate is 1 3. The entropy factor (Mnih et al., 2016) is decayed linearly from 1 to 0.001 over 10,000 training iterations. For the meta-baseline, the meta learning rate is 1 3 and the model specification has five step updates, each with learning rate 1 4. The model specification step in MAML is performed with vanilla stochastic gradient descent.