reproducibilityindex.ai

Variance Reduction for Reinforcement Learning in Input-Driven Environments

Authors: Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experimental results show that across environments from queuing systems, computer networks, and Mu Jo Co robotic locomotion, input-dependent baselines consistently improve training stability and result in better eventual policies.
Researcher Affiliation	Academia	Hongzi Mao, Shaileshh Bojja Venkatakrishnan, Malte Schwarzkopf, Mohammad Alizadeh MIT Computer Science and Artiﬁcial Intelligence Laboratory {hongzi,bjjvnkt,malte,alizadeh}@csail.mit.edu
Pseudocode	Yes	Algorithm 1 Training a meta input-dependent baseline for policy-based methods.
Open Source Code	No	The paper references third-party open-source libraries (e.g., OpenAI Baselines) but does not provide a link or statement confirming that the source code for their specific methodology (input-dependent baselines and meta-learning approach) is publicly available.
Open Datasets	Yes	We simulate real-world video streaming using public cellular network data (Riiser et al., 2013)
Dataset Splits	No	The paper mentions evaluating on "100 unseen testing input sequences" but does not specify explicit training, validation, and test dataset splits (e.g., percentages or exact counts) for overall data partitioning.
Hardware Specification	No	The paper mentions using the MuJoCo physics engine and OpenAI Gym but does not provide specific hardware details such as GPU or CPU models, memory, or cloud instance specifications used for running the experiments.
Software Dependencies	Yes	We use the Mu Jo Co physics engine (Todorov et al., 2012) in Open AI Gym (Brockman et al., 2016)... Open AI Baselines. https://github.com/openai/baselines, 2017... The activation function is Re LU (Nair & Hinton, 2010) and the optimizer is Adam (Chilimbi et al., 2014).
Experiment Setup	Yes	We use γ = 0.995 for both environments. The actor and the critic networks have 2 hidden layers, with 64 and 32 hidden neurons on each. The activation function is Re LU (Nair & Hinton, 2010) and the optimizer is Adam (Chilimbi et al., 2014). We train the policy with 16 (synchronous) parallel agents. The learning rate is 1 3. The entropy factor (Mnih et al., 2016) is decayed linearly from 1 to 0.001 over 10,000 training iterations. For the meta-baseline, the meta learning rate is 1 3 and the model speciﬁcation has ﬁve step updates, each with learning rate 1 4. The model speciﬁcation step in MAML is performed with vanilla stochastic gradient descent.