Lemur: Harmonizing Natural Language and Code for Language Agents

Authors: Yiheng Xu, Hongjin SU, Chen Xing, Boyu Mi, Qian Liu, Weijia Shi, Binyuan Hui, Fan Zhou, Yitao Liu, Tianbao Xie, Zhoujun Cheng, Siheng Zhao, Lingpeng Kong, Bailin Wang, Caiming Xiong, Tao Yu

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through meticulous pretraining using a code-intensive corpus and instruction fine-tuning on text and code data, our models achieve state-of-the-art averaged performance across diverse text and coding benchmarks. Comprehensive experiments demonstrate Lemur s superiority over existing open-source models and its proficiency across various agent tasks involving human communication, tool usage, and interaction under fully-and partiallyobservable environments.
Researcher Affiliation Collaboration University of Hong Kong XLang Lab Salesforce Research Sea AI Lab University of Washington MIT CSAIL
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks. Figure 1 shows an 'Overview of Training Procedure' which is a diagram, not pseudocode.
Open Source Code Yes Our model and code have been open-sourced at https://github.com/Open Lemur/Lemur.
Open Datasets Yes For the code part, we base it on The Stack (Kocetkov et al., 2022)... As for the text aspect, we use Refined Web (Penedo et al., 2023), Redpajama (Computer, 2023), as well as Common Crawl, Wikipedia, Books, Ar Xiv, Stack Exchange and DM Mathematics (Saxton et al., 2019) to build the textual data corpus.
Dataset Splits No The paper describes the datasets used for pre-training and instruction fine-tuning, and mentions their use for 'training and evaluation', but it does not provide specific details on training, validation, and test dataset splits (e.g., percentages, sample counts, or explicit splitting methodology).
Hardware Specification Yes We train the Lemur-70B model initialized with Llama-2-70B using a TPUv4-512 pod.
Software Dependencies No The paper mentions software like Jax, Easy LM (Geng, 2023), Huggingface Transformers (Wolf et al., 2019), and Accelerate Library, but it does not specify version numbers for these components.
Experiment Setup Yes We used a batch size of 4M tokens. Optimization was performed with Adam using a peak learning rate of 4e-5 along with β1 = 0.9 and β2 = 0.95. Gradients were clipped at 1.0. A cosine decay schedule was used for the learning rate, with a linear warmup of 2000 steps. For instruction fine-tuning, we train for 2 epochs, using Adam optimizer with a learning rate of 2e-5 and a batch size of 128.