Learning Not to Learn: Nature versus Nurture In Silico

Authors: Robert Tjarko Lange, Henning Sprekeler7290-7299

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Using analytical and numerical analyses, we show that non-adaptive behaviors are optimal in two cases when the optimal policy varies little across the tasks within the task ensemble and when the time it takes to learn the optimal policy is too long to allow a sufficient exploitation of the learned policy. Our results suggest that not only the design of the metatask distribution, but also the lifetime of the agent can have strong effects on the meta-learned algorithm of RNN-based agents. In particular, we find highly nonlinear and potentially discontinuous effects of ecological uncertainty, task complexity and lifetime on the optimal algorithm. As a consequence, a meta-learned adaptation strategy that was optimized, e.g., for a given lifetime may not generalize well to other lifetimes.
Researcher Affiliation Academia Berlin Institute of Technology, Excellence Cluster Science of Intelligence, Marchstr. 23, 10587 Berlin {robert.t.lange,h.sprekeler}@tu-berlin.de
Pseudocode No The paper does not contain pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes All code will be available at Git Hub: https://github.com/Robert TLange/ learning-not-to-learn.
Open Datasets No The paper describes using a 'minimal two-arm Gaussian bandit task' and an 'ensemble of grid worlds task', which are custom-defined environments for simulation, not standard publicly available datasets with concrete access information (link, DOI, etc.).
Dataset Splits No The paper does not explicitly provide training/validation/test dataset splits with percentages or counts. It describes an experimental setup for training agents on custom environments.
Hardware Specification No The simulations were conducted on a CPU cluster and no GPUs were used.
Software Dependencies No All simulations were implemented in Python using the Py Torch library (Paszke et al. 2017). Furthermore, all visualizations were done using Matplotlib (Hunter 2007) and Seaborn (Waskom 2021, BSD-3-Clause License). Finally, the numerical analysis was supported by Num Py (Harris et al. 2020, BSD-3-Clause License). Experiments were organized using the MLE-Infrastructure (Lange 2021, MIT license) training management system. (No version numbers provided for any of the mentioned software).
Experiment Setup Yes Table 1: Hyperparameters (architecture & training procedure) of the bandit A2C agent. Parameter Value Training episodes 30k Learning rate 0.001 L2 Weight decay λ 3e 06 Clipped gradient norm 10 Optimizer Adam Workers 2 γT 0.999 βe,T 0.005 βv 0.05 γ0 0.4 βe,0 1 LSTM hidden units 48 γ Anneal time 27k Ep. βe Anneal time 30k Ep. Learned hidden init. γ Schedule Exponential βe Schedule Linear Forget gate bias init. 1 Orthogonal weight init. Table 2: Hyperparameters (architecture & training procedure) of the gridworld A2C agent. Parameter Value Training episodes 1M Learning rate 0.001 L2 Weight decay λ 0 Clipped gradient norm 10 Optimizer Adam Workers 7 γ 0.99 βe,T 0.01 βv 0.1 βe Schedule Linear βe,0 0.5 LSTM hidden units 256 βe Anneal time 700k Learned hidden init.