SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Illustration of question enhancement and evaluation procedures in SALAD-Bench.Base questions are augmented to generate multiple subsets, including multiple-choice questions, attack-enhanced, and defense-enhanced variants. These subsets are instrumental for a comprehensive, multi-level evaluation of LLM safety metrics. The attack-enhanced subset is particularly useful for appraising defense methods, while the defense-enhanced subset is applied to assess attack strategies. Highlighted by the purple circles, the figure contrasts the nuanced safety evaluations of LLMs across different domains, tasks, and categories, underscoring distinct safety performance disparities.

Comparison between various safety evaluation benchmarks and SALAD-Bench, where Q represents raw questions in question-answering tasks, MCQ means multiple-choice questions, MD means providing multi-dimensional evaluation results for all taxonomies and H indicates manually constructed data from human.

In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, we propose \emph{SALAD-Bench}, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.SALAD-Bench is crafted with a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications and multiple-choice.

To effectively manage the inherent complexity, we introduce an innovative evaluators: the LLM-based MD-Judge for QA pairs with a particular focus on attack-enhanced queries, ensuring a seamless, and reliable evaluation. Above components extend SALAD-Bench from standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring the joint-purpose utility. Our extensive experiments shed light on the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics.

Safety categories

Three levels and 65 categories focused on safety issues.

ViewNeTI pull figure and sample novel view synthesis results.

Data source of base set in SALAD-Bench.

To tackle content duplication, we use the Locality-Sensitive Hashing algorithm combined with Sentence-BERT for sentence vector embeddings. To address the issue of benign samples and minimize manual review costs, we utilized the reward model, pre-trained on SafeRLHF, to assess the safety of each data sample.

To categorize questions from public datasets into SALAD-Bench's category-level taxonomies, we employ LLMs for automated labeling through in-context learning and consensus voting. We select Mixtral-8x7B-Instruct, Mistral-7B-Instruct, and TuluV2-dpo-70B for the task.

Construction of the attack-enhanced dataset. (a)Generate response on all candidate models. (b)Filter questions with high rejection rates. (c)Enhance remaining questions with attack methods. (d) Generate on all models, evaluate, and keep enhanced questions with lowest safety rates.

Construction pipeline of the defense-enhanced dataset. (a) Generate response on all candidate models. (b) Keep questions with high rejection rates. (c) Attack each question and keep failed ones. (d) Enhance remaining questions with defense methods.

For each question, we also generate multiple safe responses and unsafe responses as candidates. The safe and unsafe responses are generated using GPT-4 and fine-tuned GPT respectively.

Previous LLM safety benchmarks usually utilize 1) keyword matching, 2) moderation classifiers and 3) GPT to conduct safety evaluation. However, keyword matching-based evaluator struggles with various outputs from LLMs; moderation classifiers only focus on a narrow spectrum of safety threats; GPT incurs much extra cost to call APIs. We use public and self-generated data to fine-tune an LLM-based classifier from Mistral-7B.During fine-tuning, we propose a safety evaluation template to reformat question-answer pairs for MD-Judge predictions, as shown in the Figure.

We also introduce MCQ-Judge, which leverages in-context learning with regex parsing to efficiently fetch the answers.

Safety Rate

Overall safety rate results can be seen in the leaderboard. Safety rate on specific domain can be seen in the huggingface.

Attack Comparison

Attack success rate comparison of attack methods

Defense Comparison

Attack success rate comparison of defense methods on attack-enhanced subset

BibTeX

@misc{li2024saladbench,
      title={SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models}, 
      author={Lijun Li and Bowen Dong and Ruohui Wang and Xuhao Hu and Wangmeng Zuo and Dahua Lin and Yu Qiao and Jing Shao},
      year={2024},
      eprint={2402.05044},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

Overview

Comparison with other benchmarks

Video

Abstract

Safety categories

Data distribution

Question processing

Question enhancement

MD-Judge for QA pairs

MCQ-Judge

Results

Safety Rate

Attack Comparison

Defense Comparison

BibTeX