Pre-training Tasks for Embedding-based Large-scale Retrieval

Authors: Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paper includes a dedicated section '4 EXPERIMENTS' with detailed tables (Table 3, Table 4, etc.) presenting performance metrics and ablation studies on datasets like SQuAD and Natural Questions.
Researcher Affiliation Collaboration Wei-Cheng Chang , Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar Carnegie Mellon University & Google {wchang2,yiming}@cs.cmu.edu, {felixyu,yinwen,sanjivk}@google.com
Pseudocode No The paper describes the proposed methods and tasks in detail but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not contain an explicit statement indicating that the source code for the methodology described is publicly released, nor does it provide a direct link to a code repository.
Open Datasets Yes The two QA datasets we consider are SQuAD and Natural Questions. Note that each entry of QA datasets is a tuple (q, a, p)...
Dataset Splits Yes For each dataset, we consider different training/test split of the data (1%/99%, 5%/95% and, 80%/20%) in the fine-tuning stage and the 10% of training set is held out as the validation set for hyper-parameter tuning.
Hardware Specification Yes We pre-train the model on 32 TPU v3 chips for 100K steps with an Adam optimizer and batch size of 8192.
Software Dependencies No The paper mentions using 'Adam optimizer' and 'Transformer' models, but does not specify any software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch or TensorFlow versions).
Experiment Setup Yes For both towers, the final embedding is generated by applying a linear layer on the hidden state of the [CLS] token. The embedding dimension is 512. The sequence length for the query encoder and document encoder are set to be 64 and 288, respectively. We pre-train the model on 32 TPU v3 chips for 100K steps with an Adam optimizer and batch size of 8192. This process takes about 2.5 days. We use the Adam optimizer with an initial learning rate 1 × 10−4 with the warm-up ratio 0.1, followed by a linear learning rate decay. For fine-tuning, the learning rate of Adam is set to 5 × 10−5 with 2000 training steps and batch size 512.