GNN-LM: Language Modeling based on Global Contexts via GNN

Authors: Yuxian Meng, Shi Zong, Xiaoya Li, Xiaofei Sun, Tianwei Zhang, Fei Wu, Jiwei Li

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments to validate the effectiveness of the GNN-LM: GNN-LM achieves a new state-of-the-art perplexity of 14.8 on Wiki Text-103 (a 3.9 point improvement over its counterpart of the vanilla LM model), and shows substantial improvement on One Billion Word and Enwik8 datasets against strong baselines. In-depth ablation studies are performed to understand the mechanics of GNN-LM.
Researcher Affiliation Collaboration Yuxian Meng1, Shi Zong2, Xiaoya Li1, Xiaofei Sun1,4, Tianwei Zhang3, Fei Wu4, Jiwei Li1,4 1Shannon.AI, 2Nanjing University,3Nanyang Technological University, 4Zhejiang University {yuxian_meng, xiaoya_li, xiaofei_sun, jiwei_li}@shannonai.com, szong@nju.edu.cn tianwei.zhang@ntu.edu.sg, wufei@zju.edu.cn
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. The methodology is described using text and mathematical formulations.
Open Source Code Yes The code can be found at https://github.com/Shannon AI/GNN-LM
Open Datasets Yes We conduct experiments on three widely-used language modeling datasets: Wiki Text-103 (Merity et al., 2016), One Billion Word (Chelba et al., 2013) and Enwik8 (Mahoney, 2011).
Dataset Splits No The paper states the total training tokens for WikiText-103 (103M) and uses the term 'test perplexity', but it does not explicitly provide the specific percentages or counts for training, validation, and test dataset splits needed to reproduce the data partitioning. It describes context handling during training and evaluation, but not overall dataset splits.
Hardware Specification No The paper mentions 'GPU memory usage' and 'CPU machine with 64 cores' for data indexing, but it does not specify the exact models or types of GPUs or CPUs used for running the main experiments.
Software Dependencies No The paper mentions tools like FAISS but does not provide specific version numbers for any software dependencies, such as deep learning frameworks (e.g., PyTorch, TensorFlow), CUDA, or Python.
Experiment Setup Yes For all experiments, we add a 3-layer self-attention augmented GNN on top of the pretrained base LM, and use the same hidden dimension and number of heads as our base LM. We retrieve k = 1,024 nearest neighbors for each source token, among them the top 128 neighbors are used in graph, and all of them are used in computing the k NN-based probability pk NN(wt|ct). For the neighbor context window size l and r in Section 2.2, we set l = 1 and r = 1.