Pre-training Tasks for Embedding-based Large-scale Retrieval
Authors: Wei-Cheng Chang, Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a comprehensive study on the embedding-based retrieval models. We show that the key ingredient of learning a strong embedding-based Transformer model is the set of pre-training tasks. With adequately designed paragraph-level pre-training tasks, the Transformer models can remarkably improve over the widely-used BM-25 as well as embedding models without Transformers. The paper includes a dedicated section '4 EXPERIMENTS' with detailed tables (Table 3, Table 4, etc.) presenting performance metrics and ablation studies on datasets like SQuAD and Natural Questions. |
| Researcher Affiliation | Collaboration | Wei-Cheng Chang , Felix X. Yu, Yin-Wen Chang, Yiming Yang, Sanjiv Kumar Carnegie Mellon University & Google {wchang2,yiming}@cs.cmu.edu, {felixyu,yinwen,sanjivk}@google.com |
| Pseudocode | No | The paper describes the proposed methods and tasks in detail but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement indicating that the source code for the methodology described is publicly released, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | The two QA datasets we consider are SQuAD and Natural Questions. Note that each entry of QA datasets is a tuple (q, a, p)... |
| Dataset Splits | Yes | For each dataset, we consider different training/test split of the data (1%/99%, 5%/95% and, 80%/20%) in the fine-tuning stage and the 10% of training set is held out as the validation set for hyper-parameter tuning. |
| Hardware Specification | Yes | We pre-train the model on 32 TPU v3 chips for 100K steps with an Adam optimizer and batch size of 8192. |
| Software Dependencies | No | The paper mentions using 'Adam optimizer' and 'Transformer' models, but does not specify any software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch or TensorFlow versions). |
| Experiment Setup | Yes | For both towers, the final embedding is generated by applying a linear layer on the hidden state of the [CLS] token. The embedding dimension is 512. The sequence length for the query encoder and document encoder are set to be 64 and 288, respectively. We pre-train the model on 32 TPU v3 chips for 100K steps with an Adam optimizer and batch size of 8192. This process takes about 2.5 days. We use the Adam optimizer with an initial learning rate 1 × 10−4 with the warm-up ratio 0.1, followed by a linear learning rate decay. For fine-tuning, the learning rate of Adam is set to 5 × 10−5 with 2000 training steps and batch size 512. |