Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

QuaRL: Quantization for Fast and Environmentally Sustainable Reinforcement Learning

Authors: Srivatsan Krishnan, Max Lam, Sharad Chitlangia, Zishen Wan, Gabriel Barth-maron, Aleksandra Faust, Vijay Janapa Reddi

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our quantized distributed RL training system, Actor Q, demonstrates end-to-end speedups between 1.5 and 5.41 , and faster convergence over full precision training on a range of tasks (Deepmind Control Suite) and different RL algorithms (D4PG, DQN). Furthermore, we compare the carbon emissions (Kgs of CO2) of Actor Q versus standard reinforcement learning algorithms on various tasks.
Researcher Affiliation Collaboration Srivatsan Krishnan EMAIL Harvard University Maximilian Lam EMAIL Harvard University Sharad Chitlangia EMAIL Amazon Advertising Zishen Wan EMAIL Georgia Institute of Technology Gaberial Barth-Maron EMAIL Deep Mind Aleksandra Faust EMAIL Google Research, Brain Vijay Janapa Reddi EMAIL Harvard University
Pseudocode No No explicit pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes The source code is available here for the public to use: https://github.com/harvard-edge/Qua RL.
Open Datasets Yes Our quantized distributed RL training system, Actor Q, demonstrates end-to-end speedups between 1.5 and 5.41 , and faster convergence over full precision training on a range of tasks (Deepmind Control Suite) and different RL algorithms (D4PG, DQN). Furthermore, we compare the carbon emissions (Kgs of CO2) of Actor Q versus standard reinforcement learning algorithms on various tasks.
Dataset Splits No No specific dataset splits (e.g., train/validation/test percentages or sample counts for static datasets) are provided in the paper. The evaluation is based on rewards attained during interaction with environments over a specified number of training steps.
Hardware Specification Yes All the experiments are run in a distributed fashion to leverage multiple CPU cores and a GPU. A V100 GPU is used on the learner, while the actors are mapped to the CPU (1 core for each actor).
Software Dependencies No The paper mentions software components like PyTorch and ACME framework but does not provide specific version numbers for these or other libraries.
Experiment Setup Yes Policy architectures are fully connected networks with three hidden layers of size 2048. We apply a Gaussian noise layer to the output of the policy network on the actor to encourage exploration; sigma is uniformly assigned between 0 and 0.2 according to the actor being executed. On the learner side, the critic network is a three-layer hidden network with a hidden size of 512. We train policies using D4PG (Barth-Maron et al., 2018) on continuous control environments and DQN (Mnih et al., 2013) on discrete control environments. An example submitted by an actor is sampled 16 times before being removed from the replay buffer (spi=16) (lower spi is typically better as it minimizes model staleness (Fedus et al., 2020)).