Decoupled Context Processing for Context Augmented Language Modeling

Authors: Zonglin Li, Ruiqi Guo, Sanjiv Kumar

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimented with the same encoder-decoder context incorporation mechanism for both auto-regressive language modeling and open domain question answering.
Researcher Affiliation Industry Zonglin Li Google Research, New York lizonglin@google.com Ruiqi Guo Google Research, New York guorq@google.com Sanjiv Kumar Google Research, New York sanjivk@google.com
Pseudocode No The paper includes architectural diagrams (e.g., Figure 1) and describes procedures in text, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No In the ethics checklist, it states: 'We still need to clean up the code before it s ready.'
Open Datasets Yes For auto-regressive language modeling, we use English C4 [32] version 2.2.1, the same as Retro. We use the same question and context processed by [18], where the context is retrieved with DPR retriever [20].
Dataset Splits Yes Database Split # Articles # Entries C4 Train 364.6M 3382M C4 Val, unfiltered 0.3645M 3.369M C4 Val, filtered 2.868M NQ-open Train 79k NQ-open Dev 8.8k NQ-open Test 3.6k Wiki Database 21M
Hardware Specification Yes For base and large we used 64 TPUv3 chips whereas 128 TPUv3 chips for training XL. Each model is trained on 64 TPUv3 chips.
Software Dependencies No The paper mentions various software components and libraries, such as 'm T5 [42]', 'NLTK [3]', 'sentencepiece [22]', 'T5X retrieval framework [28]', 'ScaNN [14]', and 'Adafactor optimizer [36]', but does not provide specific version numbers for any of them.
Experiment Setup Yes We trained the Encoder-Decoder LM model for a total of 1, 100, 000 steps with a batch size of 512 and a default learning rate schedule of square-root decay. This corresponds to 10, 000 warmup steps with a fixed learning rate of 0.01, followed by square-root decay for 990, 000 steps. We jointly fine-tuned the encoder and decoder for 40, 000 steps with 20 context passages for each input of the train split. We used a batch size of 64, a fixed learning rate of 10 4 and Adafactor optimizer [36].