reproducibilityindex.ai

Discovering Latent Concepts Learned in BERT

Authors: Fahim Dalvi, Abdul Rafae Khan, Firoj Alam, Nadir Durrani, Jia Xu, Hassan Sajjad

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our analysis reveals interesting ﬁndings such as: i) the model learns novel concepts (e.g. animal categories and demographic groups), which do not strictly adhere to any pre-deﬁned categorization (e.g. POS, semantic tags), ii) several latent concepts are based on multiple properties which may include semantics, syntax, and morphology, iii) the lower layers in the model dominate in learning shallow lexical concepts while the higher layers learn semantic relations and iv) the discovered latent concepts highlight potential biases learned in the model.
Researcher Affiliation	Academia	Fahim Dalvi Abdul Rafae Khan Firoj Alam Nadir Durrani Jia Xu Hassan Sajjad {faimaduddin,fialam,ndurrani,hsajjad}@hbku.edu.qa Qatar Computing Research Institute, HBKU Research Complex, Doha 5825, Qatar School of Engineering and Science, Steven Institute of Technology, Hoboken, NJ 07030, USA {akhan4,jxu70}@stevens.edu
Pseudocode	Yes	Algorithm 1 (Appendix A) presents the clustering procedure.
Open Source Code	Yes	1Code and dataset: https://neurox.qcri.org/projects/bert-concept-net.html
Open Datasets	Yes	We address these problems by using a subset of a large dataset of News 2018 WMT3 (359M tokens). 3http://data.statmt.org/news-crawl/en/
Dataset Splits	Yes	We trained the model using 90% of the concept clusters and evaluate its performance on the remaining 10% concept clusters (held-out set). ... We randomly selected an equal amount of negative class instances and split the data into 80% train, 10% development and 10% test sets. ... We used standard splits for training, development and test data for the 4 linguistic tasks (POS, SEM, Chunking and CCG super tagging). The splits to preprocess the data are available through git repository8 released with Liu et al. (2019a). See Table 7 for statistics.
Hardware Specification	No	The paper mentions using the "12-layered BERT-base-cased model" but does not specify any hardware details like GPU or CPU models used for the experiments.
Software Dependencies	No	We tokenize sentences using the Moses tokenizer4 and pass them through the standard pipeline of BERT as implemented in Hugging Face.5. The paper mentions software tools like Moses tokenizer and Hugging Face but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	The number of clusters K is a hyperparameter. We empirically set K = 1000... We applied Ward s minimum variance criterion that minimizes the total within-cluster variance. The distance between two vector representations is calculated with the squared Euclidean distance. ... In order to achieve higher precision, we introduce a threshold t = 0.97 on the conﬁdence of the predicted cluster id, assigning new tokens to particular clusters only when the conﬁdence of the classiﬁer is higher than the threshold.