Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

QuaRL: Quantization for Fast and Environmentally Sustainable Reinforcement Learning

Authors: Srivatsan Krishnan, Max Lam, Sharad Chitlangia, Zishen Wan, Gabriel Barth-maron, Aleksandra Faust, Vijay Janapa Reddi

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our quantized distributed RL training system, Actor Q, demonstrates end-to-end speedups between 1.5 and 5.41 , and faster convergence over full precision training on a range of tasks (Deepmind Control Suite) and different RL algorithms (D4PG, DQN). Furthermore, we compare the carbon emissions (Kgs of CO2) of Actor Q versus standard reinforcement learning algorithms on various tasks.
Researcher Affiliation	Collaboration	Srivatsan Krishnan EMAIL Harvard University Maximilian Lam EMAIL Harvard University Sharad Chitlangia EMAIL Amazon Advertising Zishen Wan EMAIL Georgia Institute of Technology Gaberial Barth-Maron EMAIL Deep Mind Aleksandra Faust EMAIL Google Research, Brain Vijay Janapa Reddi EMAIL Harvard University
Pseudocode	No	No explicit pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	The source code is available here for the public to use: https://github.com/harvard-edge/Qua RL.
Open Datasets	Yes	Our quantized distributed RL training system, Actor Q, demonstrates end-to-end speedups between 1.5 and 5.41 , and faster convergence over full precision training on a range of tasks (Deepmind Control Suite) and different RL algorithms (D4PG, DQN). Furthermore, we compare the carbon emissions (Kgs of CO2) of Actor Q versus standard reinforcement learning algorithms on various tasks.
Dataset Splits	No	No specific dataset splits (e.g., train/validation/test percentages or sample counts for static datasets) are provided in the paper. The evaluation is based on rewards attained during interaction with environments over a specified number of training steps.
Hardware Specification	Yes	All the experiments are run in a distributed fashion to leverage multiple CPU cores and a GPU. A V100 GPU is used on the learner, while the actors are mapped to the CPU (1 core for each actor).
Software Dependencies	No	The paper mentions software components like PyTorch and ACME framework but does not provide specific version numbers for these or other libraries.
Experiment Setup	Yes	Policy architectures are fully connected networks with three hidden layers of size 2048. We apply a Gaussian noise layer to the output of the policy network on the actor to encourage exploration; sigma is uniformly assigned between 0 and 0.2 according to the actor being executed. On the learner side, the critic network is a three-layer hidden network with a hidden size of 512. We train policies using D4PG (Barth-Maron et al., 2018) on continuous control environments and DQN (Mnih et al., 2013) on discrete control environments. An example submitted by an actor is sampled 16 times before being removed from the replay buffer (spi=16) (lower spi is typically better as it minimizes model staleness (Fedus et al., 2020)).