GLoMo: Unsupervised Learning of Transferable Relational Graphs
Authors: Zhilin Yang, Jake Zhao, Bhuwan Dhingra, Kaiming He, William W. Cohen, Russ R. Salakhutdinov, Yann LeCun
NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results show that GLo Mo improves performance on various language tasks including question answering, natural language inference, and sentiment analysis. We also demonstrate that the learned graphs are generic enough to work with various sets of features on which the graphs have not been trained, including Glo Ve embeddings, ELMo embeddings, and taskspecific RNN states. We also identify key factors of learning successful generic graphs: decoupling graphs and features, hierarchical graph representations, sparsity, unit-level objectives, and sequence prediction. To demonstrate the generality of our framework, we further show improved results on image classification by applying GLo Mo to model the relational dependencies between the pixels. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University, 2New York University, 3Facebook AI Research |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code or links to a code repository for the described methodology. |
| Open Datasets | Yes | Question Answering The stanford question answering dataset [31](SQu AD) was recently proposed to advance machine reading comprehension. Natural Language Inference We chose to use the latest Multi-Genre NLI corpus (MNLI) [46]. Sentiment Analysis We use the movie review dataset collected in [22], with 25,000 training and 25,000 testing samples crawled from IMDB. Image Classification... We leverage the entire Image Net [11] dataset and have the images resized to 32x32 [27]. In the transfer phase, we chose CIFAR-10 classification as our target task. |
| Dataset Splits | Yes | Table 3: CIFAR-10 classification results. We adopt a 42,000/8,000 train/validation split once the best model is selected according to the validation error, we directly forward it to the test set without doing any validation set place-back retraining. |
| Hardware Specification | No | The paper does not explicitly describe the specific hardware used for its experiments (e.g., CPU/GPU models, memory specifications). |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers. |
| Experiment Setup | Yes | Here D is a hyper-parameter called the context length. In our implementation, at position t, in addition to predicting the forward context (xt+1, , xt+D), we also use a separate network to predict the backward context (xt D, , xt 1), similar to [30]. We also adopt the multi-head attention [42] to produce multiple graphs per layer. We only used horizontal flipping for data augmentation. |