Does Deep Learning Learn to Abstract? A Systematic Probing Framework

Authors: Shengnan An, Zeqi Lin, Bei Chen, Qiang Fu, Nanning Zheng, Jian-Guang Lou

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental A set of controlled experiments are conducted based on this framework, providing strong evidence that two probed pre-trained language models (PLMs), T5 and GPT2, have the abstraction capability. We also conduct in-depth analysis, thus shedding further light: (1) the whole training phase exhibits a "memorize-thenabstract" two-stage process; (2) the learned abstract concepts are gathered in a few middle-layer attention heads, rather than evenly distributed throughout the model; (3) the probed abstraction capabilities exhibit robustness against concept mutations, and are more robust to low-level/source-side mutations than high-level/target-side ones; (4) generic pre-training is critical to the emergence of abstraction capability, and PLMs exhibit better abstraction with larger model sizes and data scales.
Researcher Affiliation Collaboration Shengnan An , Zeqi Lin , Bei Chen , Qiang Fu , Nanning Zheng , Jian-Guang LOU Institute of Artificial Intelligence and Robotics, Xi an Jiaotong University Microsoft Corporation {an1006634493@stu, nnzheng@mail}.xjtu.edu.cn {Zeqi.Lin, beichen, qifu, jlou}@microsoft.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Our code and data are publicly available at https://github.com/microsoft/Contextual SP/tree/master/abstraction_probing.
Open Datasets Yes Our FLT tasks are majorly derived from the synthetic semantic parsing task COGS (Kim & Linzen, 2020) and the Probabilistic Context-Free Grammar (PCFG) it used. Data in our fuzzy grammar probe is taken from Europarl v7 Koehn (2005), a large parallel corpus for machine translation .
Dataset Splits Yes The early-stopping strategy is applied to avoid catastrophic forgetting. Detailed settings are listed in Appendix K. Evaluation We take an early-stopping strategy in our evaluation to avoid catastrophic forgetting. First, each checkpoint saved during fine-tuning is evaluated on the held-out dev set. We choose the first checkpoint that achieves the best dev score for testing.
Hardware Specification Yes We majorly use Tesla-V100-16GB GPUs for training and evaluation, except for the experiments on T5-Large or GPT2-Large, which require Tesla-V100-32GB GPUs.
Software Dependencies No Our experiments are based on the Huggingface Transformer models (Wolf et al., 2020). This does not specify version numbers for the software dependencies.
Experiment Setup Yes For both (continue) pre-training and fine-tuning, we take Adam (Loshchilov & Hutter, 2018) with 1e-5 learning rate and 0.01 weight decay. Batch size is 8 and max training step is 100k. We generate 3 groups of new terminals, repeat the experiments on each group with 2 random seeds, and finally average 6 results. The early-stopping strategy is applied to avoid catastrophic forgetting. Detailed settings are listed in Appendix K.