MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation
Authors: Qian Huang, Jian Vora, Percy Liang, Jure Leskovec
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgent Bench with 37.5% average success rate. |
| Researcher Affiliation | Academia | Qian Huang 1 Jian Vora 1 Percy Liang 1 Jure Leskovec 1 1Stanford University. Correspondence to: Qian Huang <qhwang@cs.stanford.edu>. |
| Pseudocode | No | The paper does not contain a dedicated section, figure, or block explicitly labeled as 'Pseudocode' or 'Algorithm'. While it describes the agent's steps and actions, these are presented as textual descriptions and tables rather than formal pseudocode. |
| Open Source Code | Yes | Our code is released at https://github.com/snapstanford/MLAgent Bench/. |
| Open Datasets | Yes | MLAgent Bench includes 13 ML tasks from diverse domains including text, image, time series, graphs, and tabular data as shown in Table 2. Our tasks include both well-studied datasets like CIFAR-10 and open challenges like Parkinson s disease progression prediction from Kaggle, which is released after the language model (e.g. GPT-4) pre-training that therefore has not been pretrained on . |
| Dataset Splits | No | The paper mentions 'validation accuracy' and 'training and testing data' but does not provide specific details on the dataset splits (e.g., percentages or sample counts for train, validation, and test sets). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or cloud computing instance types used for running the experiments. |
| Software Dependencies | No | The starter code is based on diverse ML frameworks, including Py Torch (Paszke et al., 2019), Tensor Flow (Abadi et al., 2015), JAX (Bradbury et al., 2018), Keras (Chollet et al., 2015), etc. However, the paper does not specify version numbers for these or any other software dependencies. |
| Experiment Setup | Yes | The script defines a simple CNN model with two convolution layers and three fully connected layers. It trains the model for 5 epochs on the CIFAR-10 dataset. The hyperparameters are: Learning rate: 0.1 Momentum: 0.9 Batch size: 128. Also in Appendix F: 'Edit Script (AI) Action Input: {"script_name": "train.py", "edit_instruction": "Change all instances of lr=0.1 to lr=0.3. Do not make any other changes.", "save_name": "train_lr03.py"}' |