Can LLM-Generated Misinformation Be Detected?
Authors: Canyu Chen, Kai Shu
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Then, through extensive empirical investigation, we discover that LLM-generated misinformation can be harder to detect for humans and detectors compared to human-written misinformation with the same semantics, which suggests it can have more deceptive styles and potentially cause more harm. |
| Researcher Affiliation | Academia | Canyu Chen Illinois Institute of Technology cchen151@hawk.iit.edu Kai Shu Illinois Institute of Technology kshu@iit.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | Project website: https://llm-misinformation.github.io/. This website states: 'Our code will be made publicly available soon.' and explicitly links to the dataset repository, not the methodology code. The paper also states: 'The dataset has been open-sourced in the Git Hub repository https://github.com/llm-misinformation/llm-misinformation.' which refers to data, not the source code for the methodology itself. |
| Open Datasets | Yes | We adopt three typical real-world human-written misinformation datasets including Politifact (Shu et al., 2020), Gossipcop (Shu et al., 2020) and Co AID (Cui & Lee, 2020). ... The dataset has been open-sourced in the Git Hub repository https://github.com/llm-misinformation/llm-misinformation. |
| Dataset Splits | No | The paper explicitly states that LLMs are used with a zero-shot prompting strategy as detectors, meaning they are not trained or validated on specific splits of these datasets in the traditional sense. The data mentioned is used for evaluation (testing) purposes: "we utilize the whole Politifact dataset and the randomly sampled 10% data of the Gossipcop and Co AID datasets with the random seed as 1." This describes the evaluation data, not specific training/validation splits for a model. |
| Hardware Specification | No | The paper mentions using specific LLM models (e.g., Chat GPT-3.5, GPT-4, Llama2, Vicuna) and their APIs but does not specify the underlying hardware (e.g., GPU models, CPU types) on which these models or experiments were run. |
| Software Dependencies | Yes | As for Chat GPT-3.5 (gpt-3.5-turbo) or GPT-4 (gpt-4) as generators or detectors, we adopt the default API setting of Open AI. As for Llama2 (Llama2-7B-chat, Llama2-13B-chat, and Llama2-70B-chat) and Vicuna (Vicuna-7b-v1.3, Vicuna-13b-v1.3, and Vicuna-33b-v1.3) as generators or detectors, we adopt the hyperparameters for the sampling strategy as follows: top_p = 0.9, temperature = 0.8, max_tokens = 2,000. |
| Experiment Setup | Yes | As for Llama2 (Llama2-7B-chat, Llama2-13B-chat, and Llama2-70B-chat) and Vicuna (Vicuna-7b-v1.3, Vicuna-13b-v1.3, and Vicuna-33b-v1.3) as generators or detectors, we adopt the hyperparameters for the sampling strategy as follows: top_p = 0.9, temperature = 0.8, max_tokens = 2,000. |