GLUECons: A Generic Benchmark for Learning under Constraints

Authors: Hossein Rajaby Faghihi, Aliakbar Nafar, Chen Zheng, Roshanak Mirzaee, Yue Zhang, Andrzej Uszok, Alexander Wan, Tanawan Premsri, Dan Roth, Parisa Kordjamshidi

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we create a benchmark that is a collection of nine tasks in the domains of natural language processing and computer vision. In all cases, we model external knowledge as constraints, specify the sources of the constraints for each task, and implement various models that use these constraints. We report the results of these models using a new set of extended evaluation criteria in addition to the task performances for a more in-depth analysis.
Researcher Affiliation Academia 1 Michigan State University 2 Florida Institute for Human and Machine Cognition 3 University of California Berkeley 4 University of Pennsylvania
Pseudocode No The paper describes various algorithms and methods but does not include any structured pseudocode blocks or clearly labeled algorithm figures.
Open Source Code Yes Details on experimental designs, training hyper-parameters, codes, models, and results can be found on our website2. 2https://hlr.github.io/gluecons/
Open Datasets Yes We utilize the classic MNIST (Deng 2012) dataset and classify images of handwritten digits... We use the CIFAR-100 (Krizhevsky, Sutskever, and Hinton 2012)... WIQA (Tandon et al. 2019) is a question-answering (QA) task... We use the SNLI (Bowman et al. 2015) dataset... We use Belief Bank (Kassner et al. 2021) dataset... We focus on the Co NLL2003 (Sang and De Meulder 2003) dataset... We use the Co NLL-2003 (Sang and De Meulder 2003) benchmark... We use the MNIST Arithmetic (Bloice, Roth, and Holzinger 2020) dataset.
Dataset Splits Yes We use the SNLI (Bowman et al. 2015) dataset, which includes 500k examples for training and 10k for evaluation... The dataset consists of 91 entities and 23k (2k train, 1k dev, 20k test) related facts extracted from Concept Net (Speer, Chin, and Havasi 2016).
Hardware Specification Yes Run times are recorded on a machine with Intel Core i9-9820X (10 cores, 3.30 GHz) CPU and Titan RTX with NVLink as GPU.
Software Dependencies No For a fair evaluation and to isolate the effect of the integration technique, we provide a repository of models and code for each task in both Py Torch (Paszke and Gross 2019) and Domi Knows (Faghihi et al. 2021) frameworks. It also mentions the use of 'Gurobi3' but no specific version numbers are provided for any of these software components.
Experiment Setup No The paper states, 'Details on experimental designs, training hyper-parameters, codes, models, and results can be found on our website2.' indicating that specific hyperparameter values and training configurations are not present within the main text of the paper.