Decoupled Context Processing for Context Augmented Language Modeling
Authors: Zonglin Li, Ruiqi Guo, Sanjiv Kumar
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimented with the same encoder-decoder context incorporation mechanism for both auto-regressive language modeling and open domain question answering. |
| Researcher Affiliation | Industry | Zonglin Li Google Research, New York lizonglin@google.com Ruiqi Guo Google Research, New York guorq@google.com Sanjiv Kumar Google Research, New York sanjivk@google.com |
| Pseudocode | No | The paper includes architectural diagrams (e.g., Figure 1) and describes procedures in text, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | In the ethics checklist, it states: 'We still need to clean up the code before it s ready.' |
| Open Datasets | Yes | For auto-regressive language modeling, we use English C4 [32] version 2.2.1, the same as Retro. We use the same question and context processed by [18], where the context is retrieved with DPR retriever [20]. |
| Dataset Splits | Yes | Database Split # Articles # Entries C4 Train 364.6M 3382M C4 Val, unfiltered 0.3645M 3.369M C4 Val, filtered 2.868M NQ-open Train 79k NQ-open Dev 8.8k NQ-open Test 3.6k Wiki Database 21M |
| Hardware Specification | Yes | For base and large we used 64 TPUv3 chips whereas 128 TPUv3 chips for training XL. Each model is trained on 64 TPUv3 chips. |
| Software Dependencies | No | The paper mentions various software components and libraries, such as 'm T5 [42]', 'NLTK [3]', 'sentencepiece [22]', 'T5X retrieval framework [28]', 'ScaNN [14]', and 'Adafactor optimizer [36]', but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We trained the Encoder-Decoder LM model for a total of 1, 100, 000 steps with a batch size of 512 and a default learning rate schedule of square-root decay. This corresponds to 10, 000 warmup steps with a fixed learning rate of 0.01, followed by square-root decay for 990, 000 steps. We jointly fine-tuned the encoder and decoder for 40, 000 steps with 20 context passages for each input of the train split. We used a batch size of 64, a fixed learning rate of 10 4 and Adafactor optimizer [36]. |