Neural Networks and the Chomsky Hierarchy
Authors: Gregoire Deletang, Anian Ruoss, Jordi Grau-Moya, Tim Genewein, Li Kevin Wenliang, Elliot Catt, Chris Cundy, Marcus Hutter, Shane Legg, Joel Veness, Pedro A Ortega
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we conduct an extensive empirical study (20 910 models, 15 tasks) to investigate whether insights from the theory of computation can predict the limits of neural network generalization in practice. |
| Researcher Affiliation | Collaboration | *Equal contribution. Correspondence to {gdelt, anianr}@deepmind.com. 1Deep Mind. 2Stanford University. Work performed while the author was at Deep Mind. |
| Pseudocode | Yes | Algorithm A.1: Training pipeline for our sequence prediction tasks. The comments (in blue) show an example output for the Reverse String (DCF) task. |
| Open Source Code | Yes | We provide an open-source implementation of our models, tasks, and training and evaluation suite at https://github.com/deepmind/neural_networks_chomsky_hierarchy. |
| Open Datasets | Yes | We provide an open-source implementation of our models, tasks, and training and evaluation suite at https://github.com/deepmind/neural_networks_chomsky_hierarchy. Instead of using fixed-size datasets, we define training and test distributions from which we continually sample sequences. |
| Dataset Splits | No | The paper specifies training and testing distributions for sequence lengths (e.g., 'training range N, with N = 40' and 'For testing, we sample the sequence length â„“from U(N + 1, M), with M = 500'), but it does not explicitly define a separate validation split or discuss its methodology. |
| Hardware Specification | Yes | We ran each task-architecture-hyperparameter triplet on a single TPU on our internal cluster. |
| Software Dependencies | No | The paper mentions using JAX (Bradbury et al., 2018) and the DeepMind JAX ecosystem (Babuschkin et al., 2020; Hessel et al., 2020; Hennigan et al., 2020) but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2015) with default hyperparameters for 1 000 000 steps... We run all experiments with 10 different random seeds (used for network parameter initialization) and three learning rates (1 10 4, 3 10 4 and 5 10 4), and we report the result obtained by the hyperparameters with the maximum score. |