TextGrad: Advancing Robustness Evaluation in NLP by Gradient-Driven Optimization
Authors: Bairu Hou, Jinghan Jia, Yihua Zhang, Guanhua Zhang, Yang Zhang, Sijia Liu, Shiyu Chang
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments are provided to demonstrate the effectiveness of TEXTGRAD not only in attack generation for robustness evaluation but also in adversarial defense. |
| Researcher Affiliation | Collaboration | 1UC Santa Barbara, 2Michigan State University, 3MIT-IBM Watson AI Lab |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found. |
| Open Source Code | Yes | Codes are available at https://github.com/UCSB-NLP-Chang/TextGrad |
| Open Datasets | Yes | SST-2 (Socher et al., 2013) for sentiment analysis, MNLI (Williams et al., 2018), RTE (Wang et al., 2018), and QNLI (Wang et al., 2018) for natural language inference and AG News (Zhang et al., 2015) for text classification. |
| Dataset Splits | Yes | For datasets that the labels of testing dataset are not available (MNLI, RTE, QNLI), we randomly sample 10% of training dataset as validation dataset and use the original validation dataset for testing. For the AG News dataset where the validation set is not available, we use the same way to generate the validation dataset. |
| Hardware Specification | Yes | We run our experiments on the Tesla V100 GPU with 16GB memory. |
| Software Dependencies | No | No specific version numbers for general software dependencies like Python, PyTorch, or CUDA were explicitly mentioned. |
| Experiment Setup | Yes | We fine-tune the pre-trained BERT-base-uncased model on each dataset with a batch size of 32, a learning rate of 2e-5 for 5 epochs. For RoBERTa-large and ALBERT-xxlargev2, we use a batch size of 16 and learning rate of 1e-5. ... Regarding the hyper-parameters of TEXTGRAD, we utilize 20-step PGD for optimization and fix the number of sampling R in each iteration to be 20. We adopt a learning rate of 0.8 for both z and u, and normalize the gradient g1,t and g2,t to unit norm before the descent step. |