L2MAC: Large Language Model Automatic Computer for Extensive Code Generation

Authors: Samuel Holt, Max Ruiz Luyten, Mihaela van der Schaar

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that L2MAC achieves state-of-the-art performance in generating large codebases for system design tasks, significantly outperforming other coding methods in implementing the detailed user-specified task; we show that L2MAC works for general-purpose extensive text-based tasks, such as writing an entire book; and we provide valuable insights into L2MAC s performance improvement over existing methods.
Researcher Affiliation Academia Samuel Holt University of Cambridge sih31@cam.ac.uk Max Ruiz Luyten University of Cambridge mr971@cam.ac.uk Mihaela van der Schaar University of Cambridge mv472@cam.ac.uk
Pseudocode Yes Appendix B Control Unit Operation: Here we expand on Section 3.3, provide pseudocode in Algorithm 1, provide an extend block-diagram figure of L2MAC in Figure 5, and a data flow diagram in Figure 6. and Algorithm 1 Control Unit Pseudocode for L2MAC
Open Source Code Yes Full code at https://github.com/samholt/L2MAC. and All code is available at https://github.com/samholt/L2MAC.
Open Datasets Yes We use the standard Human Eval benchmark, as introduced by Chen et al. (2021). and The prompt questions for these tasks are derived from actual system design interview questions (Xu & Lam, 2020; Martin, 2023).
Dataset Splits No The paper evaluates its performance on benchmark tasks like HumanEval, which uses 'held-out unit tests', and introduces its own system design tasks for code generation. While it details evaluation metrics, it does not specify explicit training/validation/test dataset splits for its experiments or LLM fine-tuning, as the LLM itself (GPT-4) is a pre-trained model and the focus is on its generation capabilities given prompts.
Hardware Specification No The paper mentions using GPT-4 as the underlying LLM, but it does not provide specific details about the hardware (e.g., GPU models, CPU specifications) used to conduct its experiments or run the Code-L2MAC framework.
Software Dependencies Yes For the LLM Processor, we use GPT-4-0613 Open AI (2023)... and in the example requirements.txt and preferences: flask ==1.1.2 pytest ==6.2.4 and pytest dataclasses flask.
Experiment Setup Yes We impose a maximum number of times an instruction can be re-tried r Max, with a new context window when, during the execution of that instruction, it attempts to exceed the context window, forcing the context window to restart the instruction with the summary of the current progress made we empirically set r Max = 30. and We used the LLM of GPT4-0613, and when using it throughout, set the temperature to 0.01. and Additionally, another implementation detail is that when the CU detects that it is stuck in a loop repeating the same two messages over again by comparing the most recent two messages in the context window, it increases the temperature of the LLM by 0.1 and continues until the temperature caps at 1.0, and then after it exits the loop reducing the temperature back to 0.01.