Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Reward-Aware Proto-Representations in Reinforcement Learning
Authors: Hon Tik Tse, Siddarth Chandrasekar, Marlos C. Machado
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we analyze the benefits of the DR in many of the settings in which the SR has been applied, including (1) reward shaping, (2) option discovery, (3) exploration, and (4) transfer learning. Our results show that, compared to the SR, the DR gives rise to qualitatively different, reward-aware behaviour and quantitatively better performance in several settings. 6 Experiments Proto-representations, like the SR, have been used in RL for reward shaping [54, 49], option discovery [22, 25], count-based exploration [24], and transfer [2], among others. We now revisit these settings to assess the impact of using reward-aware representations. |
| Researcher Affiliation | Academia | Hon Tik Tse Siddarth Chandrasekar Marlos C. Machado* University of Alberta, Alberta Machine Intelligence Institute (Amii) *Canada CIFAR AI Chair EMAIL |
| Pseudocode | Yes | Algorithm 1 Reward-Aware Covering Eigenoptions (RACE) |
| Open Source Code | Yes | 5. Open access to data and code Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: We provide source code in the supplementary material. |
| Open Datasets | No | The paper refers to environments such as the grid task, four rooms, grid room, grid maze (Figure 1), Riverswim and Six Arms (Section 6.3). These are descriptions of simulated environments used for experiments, but the paper does not provide concrete access information (specific links, DOIs, repositories, or formal citations) for publicly available or open datasets that were explicitly used or released as part of this research. The data is generated dynamically within these environments. |
| Dataset Splits | No | The paper describes experimental setups with multiple independent runs (e.g., "50 runs," "20 seeds," "100 independent runs") and mentions sampling terminal state reward configurations for transfer learning. However, it conducts experiments in simulated environments where data is generated dynamically, and thus does not provide specific dataset split information (percentages, sample counts, or predefined splits) for a static dataset, as typically found in supervised learning. |
| Hardware Specification | No | E Compute Resources We use CPUs for all of our experiments. We describe the runtimes for experiments involving the DR. For reward-shaping experiments, each independent run takes under 10 minutes. For eigenoption discovery experiments, each independent run takes less than 2 hours in grid task and four rooms, and takes around 10 hours in grid room and grid maze. For count-based exploration experiments, each independent run takes less than one minute. For transfer experiments, each independent run takes less than 5 minutes. Due to the large number of independent runs performed for hyperparameter search and preliminary experiments, we estimate the total compute used for the project to be 10.5 CPU core years. |
| Software Dependencies | No | To mitigate the resulting numerical issues, we use the library python-flint.1 Note, however, that the improved precision comes at the cost of increased runtime. |
| Experiment Setup | Yes | For the reward shaping approaches, we train a Q-learning [51] agent using a convex combination of the original environment reward, rt, and the shaping reward, ˆrt, resulting in the expression (1 β)rt +βˆrt, where β [0, 1] is a hyperparameter. Note that we assume access to the eigenvectors of the SR and DR prior to training the agent with the potential-based reward. Future work can explore learning the eigenvectors and the policy simultaneously. For the no shaping baseline, we simply train the Q-learning agent using the original environment reward. We use γ = 0.99, ϵ = 0.05 for ϵ-greedy exploration, λ = 1.3 for the DR, and perform a grid search over the Q-learning s step size ([0.1, 0.3, 1.0]) and β ([0.25, 0.5, 0.75, 1.0]). We run 20 seeds for each hyperparameter setting, and after identifying the best hyperparameters, re-run 50 seeds to avoid maximization bias. |