CASR: Generating Complex Sequences with Autoregressive Self-Boost Refinement

Authors: Hongwei Han, Mengyu Zhou, Shi Han, Xiu Li, Dongmei Zhang

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental By evaluating CASR on Sudoku, Web QSP, MTOP and KVRET through controlled experiments and empirical studies, we find that CASR produces high-quality outputs. CASR also improves Accuracy on Sudoku (70.93% 97.28%) and achieves state-of-the-art performance on KVRET with Micro F1 score (67.88% 70.00%).
Researcher Affiliation Collaboration Hongwei Han1 Mengyu Zhou2 Shi Han2 Xiu Li1 Dongmei Zhang2 1Tsinghua Shenzhen International Graduate School, Tsinghua University 2Microsoft Research
Pseudocode Yes Algorithm 1 CASR Inference Process. Algorithm 2 CASR Training Process ( 3.1).
Open Source Code Yes The code of CASR framework is open sourced in the repository at https: //github.com/Ralph Han/CASR.
Open Datasets Yes Web QSP (Yih et al., 2016) is a classic dataset for KBQA(Knowledge Base Question Answering). MTOP (Li et al., 2021) is a benchmark for comprehensive multilingual task-oriented semantic parsing. KVRET (Eric et al., 2017) is a benchmark for table conversation. Sudoku (PARK) is an open dataset on Kaggle.
Dataset Splits Yes Table 2: The Number of Samples in Train, Dev, and Test Splits of Web QSP, MTOP, KVRET, and Sudoku. Task Train Dev Test Web QSP 2673 309 1639 MTOP 15667 2235 4386 KVRET 6291 777 808 Sudoku 800K 100K 100K
Hardware Specification Yes We train on 4 Tesla V100 GPUs.
Software Dependencies No The paper mentions using T5-base and T5-large backbones, and deepspeed, but does not provide specific version numbers for these or other software libraries like Python, PyTorch, or CUDA.
Experiment Setup Yes For the three tasks, we set the batch-size to 128, learning-rate to 2e-5, max-input-length to 1024, max-generation-length to 128, beam-size to 4, and evaluate every 2K steps for checkpoint selection. For Sudoku, we train a 12-layer encoder-decoder transformer from scratch, with d-model=512, ffndim=2048, num-heads=8. We set max castep T = 5 and max epoch E = 10K steps. We set the batch-size to 1024, learning-rate to 2e-5, beam-size to 2, and evaluate every 2K steps for checkpoint selection.