Continual Learning In Environments With Polynomial Mixing Times

Authors: Matthew Riemer, Sharath Chandra Raparthy, Ignacio Cases, Gopeshh Subbaraj, Maximilian Puelma Touzel, Irina Rish

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 6 Empirical Analysis of Mixing Behavior, We back up our theory with empirical analysis of mixing time scaling in continual RL settings based on the Atari and Mujoco benchmarks.
Researcher Affiliation Collaboration 1IBM Research; 2Mila, Université de Montréal; 3Massachusetts Institute of Technology
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/Sharath Raparthy/polynomial_mixing_times.git.
Open Datasets Yes We perform experiments involving sequential interaction across 7 Atari environments: Breakout, Pong, Space Invaders, Beam Rider, Enduro, Sea Quest and Qbert. ... we also consider sequential Mujoco experiments with the following 5 environments: Half Cheetah, Hopper, Walker2d, Ant and Swimmer. For both our sequential Atari and Mujoco experiments, we leverage high-performing pretrained policies that are publicly available [21] for mixing time calculations.
Dataset Splits No The paper does not provide specific dataset split information (exact percentages, sample counts, citations to predefined splits, or detailed splitting methodology) for training, validation, or testing.
Hardware Specification No The paper mentions 'IBM Cognitive Compute Cluster, the Mila cluster, and Compute Canada for providing computational resources' but does not specify exact GPU/CPU models, processor types, or memory amounts used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup Yes The 60-dimensional sparse observation vector is processed by a multi-layer perceptron with two 100 unit hidden layers and Relu activations. The agent performs optimization following episodic REINFORCE. We explore different configurations of this scalable MDP by varying |Z| from 1 to 1,000 and varying τ from 100 to 10,000. ... For each |Z| and τ, we train the agent for 10,000 total steps and report the average reward rate after training as an average across 300 seeds