FLAME: A Small Language Model for Spreadsheet Formulas
Authors: Harshit Joshi, Abishai Ebenezer, José Cambronero Sanchez, Sumit Gulwani, Aditya Kanade, Vu Le, Ivan Radiček, Gust Verbruggen
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. We extensively evaluate FLAME and other larger language models on three formula assistance tasks: last-mile repair, formula completion, and formula retrieval. We analyze the impact of deduplication, tokenization, and training objectives on FLAME. |
| Researcher Affiliation | Collaboration | 1 Stanford University 2 Microsoft, USA 3 Microsoft Research, India 4 University of Washington 5 Demiurg |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | A technical appendix with operator details and evaluation dataset information is available at https://github.com/microsoft/prose-benchmarks/tree/main/FLAME |
| Open Datasets | Yes | We start with a dataset of 927M formulas drawn from a corpus of 1.8M public Excel workbooks collected from the web. Publicly available sources of spreadsheets that can be used as alternatives include Enron (Hermans and Murphy-Hill 2015), EUSES (Abraham and Erwig 2007), and FUSE (Barik et al. 2015). We sample 10K formulas from the public Enron spreadsheet corpus (Hermans and Murphy-Hill 2015) and mask constants. We use the 273 labeled Excel formulas used in recent lastmile repair literature (Joshi et al. 2022), which was sourced from Excel help forums, and refer to this benchmark set as Forum. |
| Dataset Splits | No | The paper mentions the creation of fine-tuning datasets (e.g., 200K for repair, 189K for completion) and evaluation benchmarks (Forum with 273 formulas, Synthetic with 500 formulas), but it does not provide explicit training/validation/test dataset splits with percentages or counts for these fine-tuning datasets. While 'patience of 5 epochs' implies a validation set was used for early stopping, its size or composition is not detailed. |
| Hardware Specification | Yes | We pre-train FLAME for 10 epochs and fine-tune Code T5 and FLAME on a cluster with 16 AMD MI200s, 96 cores and 900 GB RAM. We carry out all Codex experiments on a cluster with 8 V100s, 40 cores, and 672 GB RAM. |
| Software Dependencies | No | The paper mentions using an 'Ada Factor optimizer' and various models and techniques but does not specify software dependencies with version numbers (e.g., Python, PyTorch/TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For Codex baselines, we use nucleus sampling (Holtzman et al. 2019) (temperature = 0.7) and sample 50 sequences per task. For Code T5, we use beam search (width = 50), and we consider the top 50 sequences. We pre-train FLAME for 10 epochs and fine-tune Code T5 and FLAME. We use an Ada Factor optimizer with 1e-4 learning rate, clip factor of 1.0, and a linear learning rate schedule with 100 warm-up steps. We fine-tune FLAME for 2 epochs for repair and completion and fine-tune Code T5 for 25 epochs with a patience of 5 epochs. We fine-tune FLAME and others for 10 epochs for the formula retrieval experiments. For finetuning, we use a weight decay of 0.1. For Codex fine-tuning, we use Lo RA (Hu et al. 2021). |