SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

1Shanghai Artificial Intelligence Laboratory 2Harbin Institute of Technology 3Beijing Institute of Technology 4Chinese University of Hong Kong
*Equal contribution Corresponding author

Overview

Illustration of question enhancement and evaluation procedures in SALAD-Bench.Base questions are augmented to generate multiple subsets, including multiple-choice questions, attack-enhanced, and defense-enhanced variants. These subsets are instrumental for a comprehensive, multi-level evaluation of LLM safety metrics. The attack-enhanced subset is particularly useful for appraising defense methods, while the defense-enhanced subset is applied to assess attack strategies. Highlighted by the purple circles, the figure contrasts the nuanced safety evaluations of LLMs across different domains, tasks, and categories, underscoring distinct safety performance disparities.

ViewNeTI pull figure and sample novel view synthesis results.

Comparison with other benchmarks

Comparison between various safety evaluation benchmarks and SALAD-Bench, where Q represents raw questions in question-answering tasks, MCQ means multiple-choice questions, MD means providing multi-dimensional evaluation results for all taxonomies and H indicates manually constructed data from human.

ViewNeTI pull figure and sample novel view synthesis results.

Video

Abstract

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice.

To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics.

Safety categories

Three levels and 66 categories focused on safety issues.

ViewNeTI pull figure and sample novel view synthesis results.

Data distribution

Data source of base set in SALAD-Bench.

ViewNeTI pull figure and sample novel view synthesis results.

Question processing

To tackle content duplication, we use the Locality-Sensitive Hashing algorithm combined with Sentence-BERT for sentence vector embeddings. To address the issue of benign samples and minimize manual review costs, we utilized the reward model, pre-trained on SafeRLHF, to assess the safety of each data sample.

To categorize questions from public datasets into SALAD-Bench's category-level taxonomies, we employ LLMs for automated labeling through in-context learning and consensus voting. We select Mixtral-8x7B-Instruct, Mistral-7B-Instruct, and TuluV2-dpo-70B for the task.

ViewNeTI pull figure and sample novel view synthesis results.

Question enhancement

Construction of the attack-enhanced dataset. (a)Generate response on all candidate models. (b)Filter questions with high rejection rates. (c)Enhance remaining questions with attack methods. (d) Generate on all models, evaluate, and keep enhanced questions with lowest safety rates.

ViewNeTI pull figure and sample novel view synthesis results.

Construction pipeline of the defense-enhanced dataset. (a) Generate response on all candidate models. (b) Keep questions with high rejection rates. (c) Attack each question and keep failed ones. (d) Enhance remaining questions with defense methods.

ViewNeTI pull figure and sample novel view synthesis results.

For each question, we also generate multiple safe responses and unsafe responses as candidates. The safe and unsafe responses are generated using GPT-4 and fine-tuned GPT respectively.

MD-Judge for QA pairs

Previous LLM safety benchmarks usually utilize 1) keyword matching, 2) moderation classifiers and 3) GPT to conduct safety evaluation. However, keyword matching-based evaluator struggles with various outputs from LLMs; moderation classifiers only focus on a narrow spectrum of safety threats; GPT incurs much extra cost to call APIs. We use public and self-generated data to fine-tune an LLM-based classifier from Mistral-7B.During fine-tuning, we propose a safety evaluation template to reformat question-answer pairs for MD-Judge predictions, as shown in the Figure.

ViewNeTI pull figure and sample novel view synthesis results.

MCQ-Judge

We also introduce MCQ-Judge, which leverages in-context learning with regex parsing to efficiently fetch the answers.

ViewNeTI pull figure and sample novel view synthesis results.

Results

Safety Rate

Overall safety rate results can be seen in the leaderboard. Safety rate on specific domain can be seen in the huggingface.

Attack Comparison

Attack success rate comparison of attack methods

ViewNeTI pull figure and sample novel view synthesis results.

Defense Comparison

Attack success rate comparison of defense methods on attack-enhanced subset

ViewNeTI pull figure and sample novel view synthesis results.

BibTeX

@misc{li2024saladbench,
      title={SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models}, 
      author={Lijun Li and Bowen Dong and Ruohui Wang and Xuhao Hu and Wangmeng Zuo and Dahua Lin and Yu Qiao and Jing Shao},
      year={2024},
      eprint={2402.05044},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}