reproducibilityindex.ai

Linking Emergent and Natural Languages via Corpus Transfer

Authors: Shunyu Yao, Mo Yu, Yang Zhang, Karthik R Narasimhan, Joshua B. Tenenbaum, Chuang Gan

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through a series of experiments, we find that corpus transfer is helpful when the downstream natural language resource is limited. For example, in a low-resource setup of modeling two million natural language tokens, such a transfer scheme reduces the test perplexity by 24.6% on average versus training from scratch, across ten different downstream languages.
Researcher Affiliation	Collaboration	Shunyu Yao Princeton University Mo Yu Wechat AI Yang Zhang MIT-IBM Watson AI Lab Karthik Narasimhan Princeton University Joshua B. Tenenbaum MIT Chuang Gan MIT-IBM Watson AI Lab
Pseudocode	No	The paper describes methods and processes through text and equations (e.g., in Section 3.1) but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code at https://github.com/ysymyth/ec-nl and correspondence to shunyuy@princeton.edu.
Open Datasets	Yes	We scrape Wikipedia corpora of 10 languages to test downstream transfer... Implementation We train the EC game and generate ec based on the Conceptual Captions dataset (Sharma et al., 2018), using more than 2.8 million natural images in the wild... Fine-tuning Data We use the MS-COCO dataset (Lin et al., 2014) for ﬁne-tuning... For downstream language modeling, we use Image Net (Deng et al., 2009) to generate a corpus of 15 million tokens and ﬁne-tune on Romanian (ro) and Hebrew (he).
Dataset Splits	Yes	We report the test perplexity at the best validation loss. ... We use the full training set, or a subset with 5,000 or 50,000 samples to study the transfer beneﬁt when natural language annotation is limited.
Hardware Specification	Yes	Each game training only takes less than 12 hours using one Ge Force RTX 2080 GPU. ... An pre-training experiment can ﬁnish within one hour using one Ge Force RTX 3090 GPU, while a ﬁne-tuning or training-from-scratch experiment can ﬁnish within one hour using one Ge Force RTX 2080 GPU. ... Pre-training on Conceptual Captions takes 8 Ge Force RTX 3090 GPU for around two days.
Software Dependencies	No	The paper mentions using 'Huggingface's transformers' (Wolf et al., 2019) and 'FAIRSEQ' (Ott et al., 2019) with references, but does not specify the exact version numbers of these or other software dependencies used for their own implementation.
Experiment Setup	Yes	Other architecture and training details mainly follow Li et al. (2020b), and by default V = 4035, T = 15, K = 256. For language modeling, we adopt a Transformer (Vaswani et al., 2017) with 6 decoder layers and 6 attention heads, and pre-train on each source corpus for 3,000 steps with batch size 32, input length 1,000, and learning rate 5 10 4. For ﬁne-tuning and training from scratch on downstream corpora, the batch size is 8 and learning rate is 10 4.