UQE: A Query Engine for Unstructured Databases
Authors: Hanjun Dai, Bethany Wang, Xingchen Wan, Bo Dai, Sherry Yang, Azade Nova, Pengcheng Yin, Mangpo Phothilimthana, Charles Sutton, Dale Schuurmans
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We benchmark the accuracy and incurred cost of UQE on multimodal unstructured data analytics tasks, with the goal to show and understand when and why UQE can improve accuracy while keeping the cost low. |
| Researcher Affiliation | Collaboration | Hanjun Dai ˇ ( , Bethany Yixin Wang ˇ ( , Xingchen Wan , Bo Dai , Sherry Yang , Azade Nova , Pengcheng Yin , Phitchaya Mangpo Phothilimthana , Charles Sutton , Dale Schuurmans Google Deep Mind Google Cloud University of Alberta Georgia Institute of Technology |
| Pseudocode | Yes | Algorithm 1 Stratified sampling for unbiased aggregation |
| Open Source Code | No | We are working on open sourcing the code after going through the internal approval process. |
| Open Datasets | Yes | The text based tasks include IMDB [27] movie reviews, customer service dialogs including Action-Based Conversations Dataset (ABCD [8]) and Air Dialog [41], and image based Clevr [23] dataset. |
| Dataset Splits | No | The paper does not provide explicit training, validation, or test split percentages or sample counts for the datasets used. |
| Hardware Specification | Yes | The experiments were run on Mac Book Pro CPU, so we expect this bottleneck would be alleviated with better engineered system, which we will focus in our future works. |
| Software Dependencies | No | We use voyage-2 [1] to embed the text-based unstructured columns, and Vertex [40] for multimodal embeddings. For the aggregation queries, we use faiss 2 to cluster the embeddings into 10 groups... Then after every minibatch of samples collected, we train g via linear logistic regression and simply leverage sklearn for that. |
| Experiment Setup | Yes | Other parameters that might matter include: the sampling budget B for aggregation queries is 128 and for non-aggregation queries it is 256. For group-by queries the UQE needs a step in building the taxonomy, where the budget we use for that is 16. |