Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Prioritized Generative Replay
Authors: Ren Wang, Kevin Frans, Pieter Abbeel, Sergey Levine, Alexei Efros
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on both state-based and pixel-based RL tasks show PGR is consistently more sample-efficient than both model-free RL algorithms and generative approaches that do not use any guidance. In fact, by densifying the more relevant transitions, PGR is able to succeed in cases where unconditional generation struggles significantly. Moreover, we empirically demonstrate that PGR goes beyond simple prioritized experience replay; in particular, we show that conditioning on curiosity leads to more diverse and more learning-relevant generations. |
| Researcher Affiliation | Academia | Renhao Wang, Kevin Frans, Pieter Abbeel, Sergey Levine, and Alexei A. Efros Department of Electrical Engineering and Computer Science University of California, Berkeley. PA holds concurrent appointments as a Professor at UC Berkeley and as an Amazon Scholar. This paper describes work performed at UC Berkeley and is not associated with Amazon. |
| Pseudocode | Yes | Algorithm 1 Overview of our outer loop + inner loop framework. |
| Open Source Code | No | Project page available at: https://pgenreplay.github.io. This is a project page, not a direct link to a source-code repository, and the paper does not contain an unambiguous statement of code release for the methodology described. |
| Open Datasets | Yes | Our results span a range of state-based and pixel-based tasks in the Deep Mind Control Suite (DMC) (Tunyasuvunakool et al., 2020) and Open AI Gym (Brockman, 2016) environments. |
| Dataset Splits | Yes | In all tasks, we allow 100K environment interactions, a standard choice in online RL (Li et al., 2023; Kostrikov et al., 2020). Our benchmark follows exactly the online RL evaluation suite of prior work in generative RL by Lu et al. (2024), facilitating direct comparison. |
| Hardware Specification | No | Generation VRAM (GB) 4.31 6.67 (+54.7%) Generation also fits easily on modern GPUs (<12GB). The paper mentions VRAM usage and general GPU types, but does not provide specific models or detailed specifications of the hardware used. |
| Software Dependencies | No | The paper refers to algorithms and models like SAC (Haarnoja et al., 2018), REDQ (Chen et al., 2021), DRQ-V2 (Yarats et al., 2021), and PPO (Schulman et al., 2017), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | during training we randomly discard the scalar given by our relevance function F with probability 0.25. Implementation-wise, we keep both Dreal and Dsyn at 1M transitions, and randomly sample synthetic and real data mixed according to some ratio r to train our policy π. ...we increase the number of hidden layers from 2 to 3, and their widths from 256 to 512. ...we also increase batch size from 256 to 1024 to maintain per-parameter throughput. We now double the batch size to 512 and then to 1024, each time scaling r to 0.75 and 0.875, respectively. Finally, we double the UTD from 20 to 40, and the size of the synthetic data buffer Dsyn, from 1M to 2M transitions. Specifically, we parameterize the neural networks as three-layer CNNs, with bottleneck latent dimension 64 and feature output dimension 512. The CNNs are followed by a two-layer MLP projection, also of dimension 512. We resize the visual observations to 42 42 pixels, and use 8 context bins. We directly use the recommended hyperparameters in Savinov et al. (2018), setting α = 0.03, β = 0.5, M = 200, and F = percentile-90. The embedder network E is a Res Net-18 architecture with output dimension 512, followed by a four-layer MLP, also with feature and output dimensions of 512. |