Does Deep Learning Learn to Abstract? A Systematic Probing Framework
Authors: Shengnan An, Zeqi Lin, Bei Chen, Qiang Fu, Nanning Zheng, Jian-Guang Lou
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | A set of controlled experiments are conducted based on this framework, providing strong evidence that two probed pre-trained language models (PLMs), T5 and GPT2, have the abstraction capability. We also conduct in-depth analysis, thus shedding further light: (1) the whole training phase exhibits a "memorize-thenabstract" two-stage process; (2) the learned abstract concepts are gathered in a few middle-layer attention heads, rather than evenly distributed throughout the model; (3) the probed abstraction capabilities exhibit robustness against concept mutations, and are more robust to low-level/source-side mutations than high-level/target-side ones; (4) generic pre-training is critical to the emergence of abstraction capability, and PLMs exhibit better abstraction with larger model sizes and data scales. |
| Researcher Affiliation | Collaboration | Shengnan An , Zeqi Lin , Bei Chen , Qiang Fu , Nanning Zheng , Jian-Guang LOU Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University Microsoft Corporation {an1006634493@stu, nnzheng@mail}.xjtu.edu.cn {Zeqi.Lin, beichen, qifu, jlou}@microsoft.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code and data are publicly available at https://github.com/microsoft/Contextual SP/tree/master/abstraction_probing. |
| Open Datasets | Yes | Our FLT tasks are majorly derived from the synthetic semantic parsing task COGS (Kim & Linzen, 2020) and the Probabilistic Context-Free Grammar (PCFG) it used. Data in our fuzzy grammar probe is taken from Europarl v7 Koehn (2005), a large parallel corpus for machine translation . |
| Dataset Splits | Yes | The early-stopping strategy is applied to avoid catastrophic forgetting. Detailed settings are listed in Appendix K. Evaluation We take an early-stopping strategy in our evaluation to avoid catastrophic forgetting. First, each checkpoint saved during fine-tuning is evaluated on the held-out dev set. We choose the first checkpoint that achieves the best dev score for testing. |
| Hardware Specification | Yes | We majorly use Tesla-V100-16GB GPUs for training and evaluation, except for the experiments on T5-Large or GPT2-Large, which require Tesla-V100-32GB GPUs. |
| Software Dependencies | No | Our experiments are based on the Huggingface Transformer models (Wolf et al., 2020). This does not specify version numbers for the software dependencies. |
| Experiment Setup | Yes | For both (continue) pre-training and fine-tuning, we take Adam (Loshchilov & Hutter, 2018) with 1e-5 learning rate and 0.01 weight decay. Batch size is 8 and max training step is 100k. We generate 3 groups of new terminals, repeat the experiments on each group with 2 random seeds, and finally average 6 results. The early-stopping strategy is applied to avoid catastrophic forgetting. Detailed settings are listed in Appendix K. |