Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Rediscovery Hypothesis: Language Models Need to Meet Linguistics
Authors: Vassilina Nikoulina, Maxat Tezekbayev, Nuradil Kozhakhmet, Madina Babazhanova, Matthias Gallé, Zhenisbek Assylbekov
JAIR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In the first place, we show that language models that are significantly compressed but perform well on their pretraining objectives retain good scores when probed for linguistic structures. This result supports the rediscovery hypothesis and leads to the second contribution of our paper: an information-theoretic framework that relates language modeling objectives with linguistic information. This framework also provides a metric to measure the impact of linguistic information on the word prediction task. We reinforce our analytical results with various experiments, both on synthetic and on real NLP tasks in English. |
| Researcher Affiliation | Collaboration | Vassilina Nikoulina EMAIL NAVER LABS Europe 6-8 chemin de Maupertuis, 38240 Meylan, France; Maxat Tezekbayev EMAIL Nuradil Kozhakhmet EMAIL Madina Babazhanova EMAIL Nazarbayev University 53 Kabanbay Batyr ave., Nur-Sultan Z05H0P9, Kazakhstan |
| Pseudocode | Yes | Algorithm 2.1: Lottery ticket hypothesis Identifying winning tickets (Frankle & Carbin, 2019) 1 Randomly initialize a neural network f(x; θ0), θ0 Rn 2 Train the network for j iterations, arriving at parameters θj. 3 Prune p% of the parameters in θj, creating a mask m {0, 1}n. 4 Reset the remaining parameters to their values in θ0, creating the winning ticket f(x; m θ0). 5 Repeat from 2 if performing iterative pruning. 6 Train the winning ticket f(x; m θ0) to convergence. |
| Open Source Code | Yes | Finally, the SGNS model is pretrained on the text8 data (English as well) (Mahoney, 2011) using our custom implementation (Assylbekov, 2020). [Reference] Assylbekov, Z. (2020). Sgns implementation in pytorch. https://github.com/zh3nis/SGNS. Accessed: 2021-11-07. |
| Open Datasets | Yes | We pretrain Co Ve on the English German part of the IWSLT 2016 machine translation task (Cettolo et al., 2016)... Ro BERTa is pretrained on the Wiki Text-103 dataset (English) (Merity et al., 2017)... Finally, the SGNS model is pretrained on the text8 data (English as well) (Mahoney, 2011)... The edge probing classifier is trained on the standard benchmark dataset Onto Notes 5.0 (Weischedel et al., 2013)... The structural probe is trained on the English UD (Silveira et al., 2014)... For word similarities we use the Word Sim353 dataset (Finkelstein et al., 2002), while for word analogies we use the Google dataset (Mikolov et al., 2013a). |
| Dataset Splits | Yes | For training we use the text8 dataset (Mahoney, 2011), which is the first 100MB of the English Wikipedia dump on Mar. 3, 2006. For validation we use the next 10MB of the same dump. ...Ro BERTa is pretrained on the Wiki Text-103 dataset (English) (Merity et al., 2017) using the fairseq toolkit (Ott et al., 2019b) with default training settings (Ott et al., 2019a). ...The edge probing classifier is trained on the standard benchmark dataset Onto Notes 5.0 (Weischedel et al., 2013) using the jiant toolkit (Pruksachatkun et al., 2020). |
| Hardware Specification | No | The following training settings were used for training on 2 GPUs for 100 epochs: architecture: embedding size 512, 6 layers, 4 heads, hidden size 1024 batch size: 4096 tokens (with gradient accumulation for 4 steps) dropout 0.3. The paper mentions '2 GPUs' but does not specify the model or type of GPUs used. |
| Software Dependencies | No | We pretrain Co Ve on the English German part of the IWSLT 2016 machine translation task (Cettolo et al., 2016) using the Open NMT-py toolkit (Klein et al., 2017)... Ro BERTa is pretrained on the Wiki Text-103 dataset (English) (Merity et al., 2017) using the fairseq toolkit (Ott et al., 2019b)... The edge probing classifier is trained on the standard benchmark dataset Onto Notes 5.0 (Weischedel et al., 2013) using the jiant toolkit (Pruksachatkun et al., 2020)... Stanza tagger (Qi et al., 2020)... We used the SRILM toolkit (Stolcke, 2002). The paper mentions several software toolkits (Open NMT-py, fairseq, jiant, Stanza, SRILM) but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We use the default hyperparameters of Open NMT-py: batch size 64, maximum training steps 100,000. We use the following early stopping criterion: we stop training the model if there is no improvement in 3 consecutive validation perplexity scores (validation is performed every 2,000 steps). ...architecture: embedding size 512, 6 layers, 4 heads, hidden size 1024 batch size: 4096 tokens (with gradient accumulation for 4 steps) dropout 0.3. ...Embedding size 200, 15 epochs to train, 5 negative samples per training example, batch size 1024, window size 5. ...We use Adam optimizer with its default settings. |