Discovering Latent Concepts Learned in BERT

Authors: Fahim Dalvi, Abdul Rafae Khan, Firoj Alam, Nadir Durrani, Jia Xu, Hassan Sajjad

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our analysis reveals interesting findings such as: i) the model learns novel concepts (e.g. animal categories and demographic groups), which do not strictly adhere to any pre-defined categorization (e.g. POS, semantic tags), ii) several latent concepts are based on multiple properties which may include semantics, syntax, and morphology, iii) the lower layers in the model dominate in learning shallow lexical concepts while the higher layers learn semantic relations and iv) the discovered latent concepts highlight potential biases learned in the model.
Researcher Affiliation Academia Fahim Dalvi Abdul Rafae Khan Firoj Alam Nadir Durrani Jia Xu Hassan Sajjad {faimaduddin,fialam,ndurrani,hsajjad}@hbku.edu.qa Qatar Computing Research Institute, HBKU Research Complex, Doha 5825, Qatar School of Engineering and Science, Steven Institute of Technology, Hoboken, NJ 07030, USA {akhan4,jxu70}@stevens.edu
Pseudocode Yes Algorithm 1 (Appendix A) presents the clustering procedure.
Open Source Code Yes 1Code and dataset: https://neurox.qcri.org/projects/bert-concept-net.html
Open Datasets Yes We address these problems by using a subset of a large dataset of News 2018 WMT3 (359M tokens). 3http://data.statmt.org/news-crawl/en/
Dataset Splits Yes We trained the model using 90% of the concept clusters and evaluate its performance on the remaining 10% concept clusters (held-out set). ... We randomly selected an equal amount of negative class instances and split the data into 80% train, 10% development and 10% test sets. ... We used standard splits for training, development and test data for the 4 linguistic tasks (POS, SEM, Chunking and CCG super tagging). The splits to preprocess the data are available through git repository8 released with Liu et al. (2019a). See Table 7 for statistics.
Hardware Specification No The paper mentions using the "12-layered BERT-base-cased model" but does not specify any hardware details like GPU or CPU models used for the experiments.
Software Dependencies No We tokenize sentences using the Moses tokenizer4 and pass them through the standard pipeline of BERT as implemented in Hugging Face.5. The paper mentions software tools like Moses tokenizer and Hugging Face but does not provide specific version numbers for these dependencies.
Experiment Setup Yes The number of clusters K is a hyperparameter. We empirically set K = 1000... We applied Ward s minimum variance criterion that minimizes the total within-cluster variance. The distance between two vector representations is calculated with the squared Euclidean distance. ... In order to achieve higher precision, we introduce a threshold t = 0.97 on the confidence of the predicted cluster id, assigning new tokens to particular clusters only when the confidence of the classifier is higher than the threshold.