Large Batch Experience Replay

Authors: Thibault Lahire, Matthieu Geist, Emmanuel Rachelson

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Section 4 empirically evaluates the proposed sampling schemes, in particular the La BER algorithm. We discuss each separate aspect of sampling, their explanations and perspectives. The first column compares the La BER agent (La BER-mean with m = 4) with exact and approximate gradient norms, and DQN with conventional and large mini-batch size (B and 4B) on Min Atar games. The second column compares different values of m for the La BER-mean agent. The third column compares the three flavors of La BER with m = 4. Finally, the last column compares all the studied algorithms. The x-axis is the number of interaction steps in millions. The y-axis is the average sum of rewards gathered along each episode.
Researcher Affiliation Collaboration 1ISAE-SUPAERO, Universit e de Toulouse, France 2Google Research, Brain Team.
Pseudocode Yes Algorithm 1 PER and GER Require: Replay buffer, priority list, mini-batch size. Priorities := TD errors (PER) or θℓ(qi, yi) 2 (GER). loop Sample a prioritized mini-batch. Compute per-sample gradients. Update priorities. Perform SGD update. end loop Algorithm 2 La BER with surrogate priorities Require: replay buffer, mini-batch size, large batch size. Sample uniformly a large batch. Compute surrogate priorities (e.g. TD errors). Down-sample according to surrogate priorities. Compute per-sample gradients on mini-batch. Perform SGD update on mini-batch. end loop
Open Source Code Yes To ease reproducibility, we make our code available at https://github.com/sureli/laber and recall all hyperparameters in Appendix E.
Open Datasets Yes For Atari games, we used the Dopamine (Castro et al., 2018) DQN and C51 implementations as baselines, upon which we implemented prioritization. We follow the procedures of Machado et al. (2018) to train agents in the ALE (Arcade Learning Environment). Bellemare et al., 2013). For continuous control tasks, our extensions have been implemented over the SAC and TD3 agents provided by Stable-Baselines3 (Raffin et al., 2019). (Coumans & Bai, 2016 2019).
Dataset Splits No The paper does not provide explicit details about train/validation/test splits beyond mentioning the datasets themselves. It refers to standard practices and existing implementations (Dopamine, Stable-Baselines3), but does not state the specific split percentages or sample counts used for validation or testing within the paper.
Hardware Specification Yes The results on Atari games were obtained with single node computations. Each node contained 2 12-core Skylake Intel(R) Xeon(R) Gold 6126 2.6 GHz CPUs with 96 Go of RAM and 2 NVIDIA(R) Tesla(R) V100 32Go GPUs (only one was used per experiment). The results on the Min Atar and Py Bullet environments used single nodes also. Each of these nodes was composed of 2 12-core Skylake Intel(R) Xeon(R) Gold 6126 2.6 GHz CPUs with 96 Go of RAM (no GPU hardware).
Software Dependencies No The paper mentions several software components like Dopamine, Stable-Baselines3, RMSProp, Adam, and neural network libraries (e.g., Conv, FC, ReLU, Softmax). However, it does not provide specific version numbers for these components, which is required for reproducible software dependency information. For example, it lists: "Optimizer RMSProp (lr: 0.00025, Smoothing constant: 0.95, Centered: True, Epsilon: 10 5)" and "Optimizer Adam (lr: 0.001, Epsilon: 0.00001)" but not the versions of the optimizers themselves or the frameworks they are part of.
Experiment Setup Yes We emphasize that all baseline algorithms have been used with default hyperparameters. For each experiment we ran n independent simulations and reported the average and the standard deviation, where n = 3 for Atari games (Bellemare et al., 2013), n = 6 for Min Atar games (Young & Tian, 2019) and n = 9 for Pybullet environments (Coumans & Bai, 2016 2019). Table 2. DQN parameters for Atari Parameter Value Discount factor (γ) 0.99 Mini-batch size 32 Replay buffer size 106 Target update period 8000 Interaction period 4 Random actions rate 0.01 (with a linear decay of period 2.5 105 steps) Q-network structure Conv4 8,8 32 Conv2 4,4 64 Conv1 3,3 64 FC 512 FC n A Activations Re LU (except for the output layer) Optimizer RMSProp (lr: 0.00025, Smoothing constant: 0.95, Centered: True, Epsilon: 10 5). Table 3. DQN parameters for Min Atar... Table 4. SAC parameters... Table 5. TD3 parameters...