reproducibilityindex.ai

MLAgentBench: Evaluating Language Agents on Machine Learning Experimentation

Authors: Qian Huang, Jian Vora, Percy Liang, Jure Leskovec

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We benchmark agents based on Claude v1.0, Claude v2.1, Claude v3 Opus, GPT-4, GPT-4-turbo, Gemini-Pro, and Mixtral and find that a Claude v3 Opus agent is the best in terms of success rate. It can build compelling ML models over many tasks in MLAgent Bench with 37.5% average success rate.
Researcher Affiliation	Academia	Qian Huang 1 Jian Vora 1 Percy Liang 1 Jure Leskovec 1 1Stanford University. Correspondence to: Qian Huang <qhwang@cs.stanford.edu>.
Pseudocode	No	The paper does not contain a dedicated section, figure, or block explicitly labeled as 'Pseudocode' or 'Algorithm'. While it describes the agent's steps and actions, these are presented as textual descriptions and tables rather than formal pseudocode.
Open Source Code	Yes	Our code is released at https://github.com/snapstanford/MLAgent Bench/.
Open Datasets	Yes	MLAgent Bench includes 13 ML tasks from diverse domains including text, image, time series, graphs, and tabular data as shown in Table 2. Our tasks include both well-studied datasets like CIFAR-10 and open challenges like Parkinson s disease progression prediction from Kaggle, which is released after the language model (e.g. GPT-4) pre-training that therefore has not been pretrained on .
Dataset Splits	No	The paper mentions 'validation accuracy' and 'training and testing data' but does not provide specific details on the dataset splits (e.g., percentages or sample counts for train, validation, and test sets).
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models, memory specifications, or cloud computing instance types used for running the experiments.
Software Dependencies	No	The starter code is based on diverse ML frameworks, including Py Torch (Paszke et al., 2019), Tensor Flow (Abadi et al., 2015), JAX (Bradbury et al., 2018), Keras (Chollet et al., 2015), etc. However, the paper does not specify version numbers for these or any other software dependencies.
Experiment Setup	Yes	The script defines a simple CNN model with two convolution layers and three fully connected layers. It trains the model for 5 epochs on the CIFAR-10 dataset. The hyperparameters are: Learning rate: 0.1 Momentum: 0.9 Batch size: 128. Also in Appendix F: 'Edit Script (AI) Action Input: {"script_name": "train.py", "edit_instruction": "Change all instances of lr=0.1 to lr=0.3. Do not make any other changes.", "save_name": "train_lr03.py"}'