Parseval Regularization for Continual Reinforcement Learning
Authors: Wesley Chung, Lynn Cherif, Doina Precup, David Meger
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct comprehensive ablations to identify the source of its benefits and investigate the effect of certain metrics associated to network trainability including weight matrix rank, weight norms and policy entropy. We empirically demonstrate that this addition facilitates learning on sequences of RL tasks as seen in Fig. 1 and Fig. 4 in Meta World [62], CARL [9] and Gridworld environments. |
| Researcher Affiliation | Academia | Wesley Chung, Lynn Cherif, David Meger, Doina Precup Mila, Mc Gill University chungwes@mila.quebec |
| Pseudocode | No | The paper includes code snippets for environment sequence generation (Listing 1 and Listing 2) but no formal pseudocode or algorithm blocks describing the Parseval regularization methodology itself. |
| Open Source Code | No | Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [No] Justification: Guidelines: While we encourage the release of code and data, we understand that this might not be possible, so No is an acceptable answer. |
| Open Datasets | Yes | We run experiments in four sets of environments: The first is a navigation task in a 15-by-15 Gridworld... Then, we consider two environments from the CARL suite [9]: Lunar Lander and DMCQuadruped... As a final benchmark, we use environments from the Meta World suite [62]... |
| Dataset Splits | Yes | We produce 20 sequences of tasks, where each task corresponds to one Metaworld environment. We use a stratified sampling approach to ensure that each of the 19 environments are present the same number of times (or a difference of one) in all the sequences. Moreover, each sequence of tasks does not contain the same environment twice. These choices promote a diversity of task orderings and task choices. Overall, we obtain 20 sequences of 10 tasks each. We call this benchmark Metaworld20-10. Each agent is run on every sequence with 3 seeds each. |
| Hardware Specification | Yes | We run the experiments on CPUs given the small size of the networks on a combination of Intel Gold 6148 Skylake at 2.4 GHz, AMD Rome 7532 at 2.40 GHz 256M cache L3 and AMD Rome 7502 at 2.50 GHz 128M cache L3 CPUs as part of a cluster. |
| Software Dependencies | No | We use the RPO agent [50] (a variant of PPO) for continuous actions or PPO for discrete actions, based on the implementation from Clean RL [28]. (No version numbers provided for Clean RL or other software libraries.) |
| Experiment Setup | Yes | Table 2: Hyperparameters for RPO and PPO |