Continual Learning In Environments With Polynomial Mixing Times
Authors: Matthew Riemer, Sharath Chandra Raparthy, Ignacio Cases, Gopeshh Subbaraj, Maximilian Puelma Touzel, Irina Rish
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Empirical Analysis of Mixing Behavior, We back up our theory with empirical analysis of mixing time scaling in continual RL settings based on the Atari and Mujoco benchmarks. |
| Researcher Affiliation | Collaboration | 1IBM Research; 2Mila, Université de Montréal; 3Massachusetts Institute of Technology |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/Sharath Raparthy/polynomial_mixing_times.git. |
| Open Datasets | Yes | We perform experiments involving sequential interaction across 7 Atari environments: Breakout, Pong, Space Invaders, Beam Rider, Enduro, Sea Quest and Qbert. ... we also consider sequential Mujoco experiments with the following 5 environments: Half Cheetah, Hopper, Walker2d, Ant and Swimmer. For both our sequential Atari and Mujoco experiments, we leverage high-performing pretrained policies that are publicly available [21] for mixing time calculations. |
| Dataset Splits | No | The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing. |
| Hardware Specification | No | The paper mentions 'IBM Cognitive Compute Cluster, the Mila cluster, and Compute Canada for providing computational resources' but does not specify exact GPU/CPU models, processor types, or memory amounts used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | The 60-dimensional sparse observation vector is processed by a multi-layer perceptron with two 100 unit hidden layers and Relu activations. The agent performs optimization following episodic REINFORCE. We explore different configurations of this scalable MDP by varying |Z| from 1 to 1,000 and varying τ from 100 to 10,000. ... For each |Z| and τ, we train the agent for 10,000 total steps and report the average reward rate after training as an average across 300 seeds |