Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Ranking Policy Gradient
Authors: Kaixiang Lin, Jiayu Zhou
ICLR 2020 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments showing that when consolidating with the off-policy learning framework, RPG substantially reduces the sample complexity, comparing to the state-of-the-art. |
| Researcher Affiliation | Academia | Kaixiang Lin Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824-4403, USA EMAIL Jiayu Zhou Department of Computer Science and Engineering Michigan State University East Lansing, MI 48824-4403, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Off-Policy Learning for Ranking Policy Gradient (RPG) |
| Open Source Code | Yes | Code is available at https://github.com/illidanlab/rpg. |
| Open Datasets | Yes | To evaluate the sample-ef๏ฌciency of Ranking Policy Gradient (RPG), we focus on Atari 2600 games in Open AI gym Bellemare et al. (2013); Brockman et al. (2016) |
| Dataset Splits | No | The paper does not explicitly provide training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models. |
| Software Dependencies | No | The paper mentions "Dopamine framework" and "openai baselines" but does not specify their version numbers or other software dependencies with version numbers. |
| Experiment Setup | Yes | The network architecture is the same as the convolution neural network used in DQN Mnih et al. (2015). We update the RPG network every four timesteps with a minibatch of size 32. The replay ratio is equal to eight for all baselines and RPG (except for ACER we use the default setting in openai baselines Dhariwal et al. (2017) for better performance). ... Table 3: Hyperparameters of RPG network (Hyperparameters Value Architecture Conv(32-8 8-4) -Conv(64-4 4-2) -Conv(64-3 3-2) -FC(512) Learning rate 0.0000625 Batch size 32 Update period 4 Margin in Eq (6) 1) |