reproducibilityindex.ai

Decoupled Context Processing for Context Augmented Language Modeling

Authors: Zonglin Li, Ruiqi Guo, Sanjiv Kumar

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimented with the same encoder-decoder context incorporation mechanism for both auto-regressive language modeling and open domain question answering.
Researcher Affiliation	Industry	Zonglin Li Google Research, New York lizonglin@google.com Ruiqi Guo Google Research, New York guorq@google.com Sanjiv Kumar Google Research, New York sanjivk@google.com
Pseudocode	No	The paper includes architectural diagrams (e.g., Figure 1) and describes procedures in text, but it does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	In the ethics checklist, it states: 'We still need to clean up the code before it s ready.'
Open Datasets	Yes	For auto-regressive language modeling, we use English C4 [32] version 2.2.1, the same as Retro. We use the same question and context processed by [18], where the context is retrieved with DPR retriever [20].
Dataset Splits	Yes	Database Split # Articles # Entries C4 Train 364.6M 3382M C4 Val, unfiltered 0.3645M 3.369M C4 Val, filtered 2.868M NQ-open Train 79k NQ-open Dev 8.8k NQ-open Test 3.6k Wiki Database 21M
Hardware Specification	Yes	For base and large we used 64 TPUv3 chips whereas 128 TPUv3 chips for training XL. Each model is trained on 64 TPUv3 chips.
Software Dependencies	No	The paper mentions various software components and libraries, such as 'm T5 [42]', 'NLTK [3]', 'sentencepiece [22]', 'T5X retrieval framework [28]', 'ScaNN [14]', and 'Adafactor optimizer [36]', but does not provide specific version numbers for any of them.
Experiment Setup	Yes	We trained the Encoder-Decoder LM model for a total of 1, 100, 000 steps with a batch size of 512 and a default learning rate schedule of square-root decay. This corresponds to 10, 000 warmup steps with a ﬁxed learning rate of 0.01, followed by square-root decay for 990, 000 steps. We jointly ﬁne-tuned the encoder and decoder for 40, 000 steps with 20 context passages for each input of the train split. We used a batch size of 64, a ﬁxed learning rate of 10 4 and Adafactor optimizer [36].