Deprecated: The each() function is deprecated. This message will be suppressed on further calls in /home/zhenxiangba/zhenxiangba.com/public_html/phproxy-improved-master/index.php on line 456
Paper page - XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models
[go: Go Back, main page]

Papers
arxiv:2308.01263

XSTest: A Test Suite for Identifying Exaggerated Safety Behaviours in Large Language Models

Published on Aug 2, 2023
Authors:
,
,
,

Abstract

A new test suite called XSTest is introduced to identify exaggerated safety behaviors in large language models, highlighting their systematic failures to comply with safe prompts.

Without proper safeguards, large language models will readily follow malicious instructions and generate toxic content. This motivates safety efforts such as red-teaming and large-scale feedback learning, which aim to make models both helpful and harmless. However, there is a tension between these two objectives, since harmlessness requires models to refuse complying with unsafe prompts, and thus not be helpful. Recent anecdotal evidence suggests that some models may have struck a poor balance, so that even clearly safe prompts are refused if they use similar language to unsafe prompts or mention sensitive topics. In this paper, we introduce a new test suite called XSTest to identify such eXaggerated Safety behaviours in a structured and systematic way. In its current form, XSTest comprises 200 safe prompts across ten prompt types that well-calibrated models should not refuse to comply with. We describe XSTest's creation and composition, and use the test suite to highlight systematic failure modes in a recently-released state-of-the-art language model.

Community

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2308.01263
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 20

Browse 20 models citing this paper

Datasets citing this paper 10

Browse 10 datasets citing this paper

Spaces citing this paper 66

Browse 66 spaces citing this paper

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.