reproducibilityindex.ai

A Pseudo-Semantic Loss for Autoregressive Models with Logical Constraints

Authors: Kareem Ahmed, Kai-Wei Chang, Guy Van den Broeck

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach on Sudoku and shortest-path prediction cast as autoregressive generation, and observe that we greatly improve upon the base model s ability to predict logically-consistent outputs. We also evaluate on the task of detoxifying large language models.
Researcher Affiliation	Academia	Kareem Ahmed Department of Computer Science University of California, Los Angeles ahmedk@cs.ucla.edu Kai-Wei Chang Department of Computer Science University of California, Los Angeles kwchang@cs.ucla.edu Guy Van den Broeck Department of Computer Science University of California, Los Angeles guyvdb@cs.ucla.edu
Pseudocode	Yes	Algorithm 1 LSL pseudo(α; pθ) 1: Input: Logical constraint α and model pθ. 2: Output: Pseudo-semantic loss of α w.r.t. θ 3: // Obtain sample y from pθ 4: y pθ 5: // Get sequence length and num. of categories 6: seq, cats = y.shape() 7: // Expand the batch to contain all perturbations 8: // of y that are a Hamming distance of 1 away 9: y = y.expand(seq, cats) 10: y[:, range(seq), :, range(seq)] = range(cats) 11: // Evaluate expanded samples through model 12: log pθ = pθ(y).log_softmax(dim= 1) 13: // Compute the conditional probabilities: 14: // log pθ[i][j] = log pθ(yj\|y j) 15: log pθ = log pθ log pθ.logsumexp(dim= 1) 16: // Compute the probability of α under py 17: // by propagating the conditionals through cα 18: return log py(α)
Open Source Code	Yes	Our code is available at github.com/UCLA-Star AI/Pseudo SL.
Open Datasets	Yes	We use the dataset provided by Wang et al. [43], consisting of 10K Sudoku puzzles, split into 9K training examples, and 1K test samples, all puzzles having 10 missing entries. For this task, we follow the experimental setting set forth by [33], where our training set consists of 10, 000 terrain maps curated using Warcraft II tileset. Following previous work [15, 42], we evaluate on the REALTOXICITYPROMPTS, a dataset of almost 100k prompts ranging from nontoxic, assigned a toxicity score of 0, to very toxic, assigned a toxicity score of 1.
Dataset Splits	Yes	A randomized 10k portion of the Real Toxicity Prompts dataset was used to determine early stopping.
Hardware Specification	Yes	The experiments were run on a server with an AMD EPYC 7313P 16-Core Processor @ 3.7GHz, 2 NVIDIA RTX A6000, and 252 GB RAM.
Software Dependencies	No	The paper mentions software like PyTorch, Huggingface Accelerate, and Py SDD compiler, but does not provide specific version numbers for these components. For example, it states 'uses Py Torch [31]' without specifying the PyTorch version number.
Experiment Setup	Yes	We use a batch size of 16, a learning rate of 1e-5 with the Adam W optimizer [23] with otherwise default parameters. We did a grid search over the pseudo-semantic loss weight in the values {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 2, 4, 8}. We used Adam with default Py Torch parameters and a learning rate of 3e-4. We did a grid search over the pseudo-semantic loss weight in the values {0.01, 0.05}. We used Adam with the default Py Torch parameters and a learning rate of 5e-4. We did a grid search over the pseudo-semantic loss weight in the values {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1}.