Think Big, Teach Small: Do Language Models Distil Occam’s Razor?
Authors: Gonzalo Jaimovitch-Lopez, David Castellano Falcón, Cesar Ferri, José Hernández-Orallo
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We frame this question as a teaching problem with strong priors, and study whether language models can identify simple algorithmic concepts from small witness sets. In particular, we explore how several GPT architectures, program induction systems and humans perform in terms of the complexity of the concept and the number of additional examples, and how much their behaviour differs. This first joint analysis of language models and machine teaching can address key questions for artificial intelligence and machine learning, such as whether some strong priors, and Occam s razor in particular, can be distilled from data, making learning from a few examples possible. |
| Researcher Affiliation | Academia | 1VRAIN Universitat Politècnica de València 2Leverhulme Centre for the Future of Intelligence University of Cambridge |
| Pseudocode | No | The paper describes a programming language (P3) and its instructions, but it does not include any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | All the code and experimental data can be found at https://github.com/gonzalojaimovitch/think-big-teach-small. |
| Open Datasets | No | The paper states, "We follow Telle et al. (2019) to generate the teaching book : all P3 concepts and the associated witness sets (assuming a simplicity-based learner)." This indicates the dataset was generated for the study, and no specific link or citation is provided for a pre-existing publicly available dataset used for training. |
| Dataset Splits | No | The paper describes its teaching protocol with three phases (WS, AS I, AS II) for providing examples, and using "test examples" for evaluation. However, it does not specify traditional dataset splits (e.g., 80/10/10 percentages or counts) for a fixed dataset, as the examples are generated and incrementally provided. |
| Hardware Specification | No | The main text of the paper does not specify the hardware used for the experiments. It only refers to the supplementary material for this information. |
| Software Dependencies | No | While the paper names models like GPT-2, GPT-3, Magic Haskeller, and Louise, it does not provide specific version numbers for any of these software components or any other libraries used. |
| Experiment Setup | Yes | The default hyperparameters of the model are preserved, with the exception of the length of the output returned, which is modified to return 3 tokens. [...] Deterministic results can be obtained by setting the top-k parameter to 1, but we can get better performance if we use several continuations. Accordingly we will use the default top-k value, which is 0. |