Capturing Structural Locality in Non-parametric Language Models
Authors: Frank F. Xu, Junxian He, Graham Neubig, Vincent Josua Hellendoorn
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two different domains, Java source code and Wikipedia text, demonstrate that locality features improve model efficacy over models without access to these features, with interesting differences. We also perform an analysis of how and where locality features contribute to improved performance and why the traditionally used contextual similarity metrics alone are not enough to grasp the locality structure. |
| Researcher Affiliation | Academia | School of Computer Science Carnegie Mellon University {fangzhex,junxianh,gneubig}@cs.cmu.edu, vhellendoorn@cmu.edu |
| Pseudocode | No | No pseudocode or algorithm blocks are explicitly labeled or presented in a structured format. |
| Open Source Code | Yes | The source code package containing a README document on how to reproduce the results and analysis and experiment scripts is available in the paper s supplementary material. |
| Open Datasets | Yes | WIKITEXT-103 is a standard language modeling benchmark (Merity et al., 2016) consisting of natural language text from English Wikipedia. It contains a 250K token, word-level vocabulary, with 103M tokens in the training set and 250K tokens in both the validation and test sets. [...] https://blog.einstein.ai/the-wikitext-long-term-dependency-language-modeling-dataset/. JAVA GITHUB is a programming language corpus containing Java source code from Github (Allamanis & Sutton, 2013) that is widely used in source code modeling (Hellendoorn & Devanbu, 2017; Karampatsis et al., 2020). [...] https://zenodo.org/record/3628665. |
| Dataset Splits | Yes | WIKITEXT-103 [...] with 103M tokens in the training set and 250K tokens in both the validation and test sets. [...] JAVA GITHUB [...] It contains 1.44B tokens from 13,362 projects in the training split, 3.83M tokens from 36 projects in the validation split and 5.33M tokens from 38 projects in the test split. The splits are separated by whole projects. |
| Hardware Specification | Yes | All experiments are conducted on a single machine with a 48 core CPU and 8 NVIDIA V100 32GB GPU. |
| Software Dependencies | No | The paper mentions using a pre-trained model ('For WIKITEXT-103 we use the pretrained model provided by (Khandelwal et al., 2020)'), but does not specify software dependencies with version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | In our experiments, we follow Khandelwal et al. (2020) in setting the interpolation factor λ to 0.25. [...] To optimize the parameters, we use the Adam (Kingma & Ba, 2014) optimizer with a learning rate of 0.0001 on the validation set for 200 epochs." and "train an LM with the exact architecture and optimization described by Baevski & Auli (2018): a decoder-only Transformer (Vaswani et al., 2017), with 1024 dimensional hidden states for the WIKITEXT-103 dataset and 512 for JAVA GITHUB. |