Quantitative Certification of Bias in Large Language Models

University of Illinois Urbana-Champaign1, Amazon 2, Pyron 3, VMWare Research 4

Abstract

Content Warning: This work contains examples of offensive language.

Large Language Models (LLMs) can produce responses that exhibit social biases and support stereotypes. However, conventional benchmarking is insufficient to thoroughly evaluate LLM bias, as it can not scale to large sets of prompts and provides no guarantees. Therefore, we propose a novel certification framework QuaCer-B that provides formal guarantees on obtaining unbiased responses from target LLMs under large sets of prompts. A certificate consists of high-confidence bounds on the probability of obtaining biased responses from the LLM for any set of prompts containing sensitive attributes, sampled from a distribution. We illustrate the bias certification in LLMs for prompts with various prefixes drawn from given distributions. We consider distributions of random token sequences, mixtures of manual jailbreaks, and jailbreaks in the LLM’s embedding space to certify its bias. We certify popular LLMs with QuaCer-B and present novel insights into their biases.

Overview

Large Language Models (LLMs) have shown impressive performance as chatbots, and are hence used by millions of people worldwide. This, however, brings their safety and trustworthiness to the forefront, making it imperative to guarantee their reliability. Prior work has generally focused on establishing the trust in LLMs using evaluations on standard benchmarks. This analysis, however, is insufficient due to the limitations of the benchmarking datasets, their use in LLMs' safety training, and the lack of guarantees through benchmarking. As an alternative, we propose quantitative certificates for LLMs and develop a novel framework, QuaCer-B, to quantitatively certify LLMs for bias in their responses. We define bias as an assymetry in the LLM's responses for a set of prompts that differ only by a sensitive attribute.

QuaCer-B considers a given distribution of sets of prompts to certify a target LLM. The certificate consists of high-confidence bounds on the probability of obtaining a biased response from the LLM for a randomly sampled prompt from the distribution. These bounds are compared against a user-defined bias-tolerance level and the LLM is certified biased if the lower bound is more than the threshold and unbiased if the upper bound is lower than the threshold. The figure below presents an overview of QuaCer-B on an example distribution of prompts developed from a sample from the BOLD dataset.

Quacer-B
(Overview of QuaCer-B): QuaCer-B is a quantitative certification framework to certify the bias in the responses of a target LLM for a random set of prompts that differ by their sensitive attribute. In specific instantiations, QuaCer-B samples a (a) set of prefixes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. The target LLM’s (c) responses are checked for bias by a bias detector, whose results are fed into a certifier. The certifier computes bounds on the probability of obtaining biased responses from the target LLM for any set of prompts formed with a random prefix from the distribution. The certifier adaptively samples prefixes, until either the lower bound of bias is more than or the upper bound is less than a threshold η, certifying the model to be biased or unbiased respectively

We illustrate certificates generated by QuaCer-B for the popular, SOTA LLMs with 3 kinds of distributions. Each distribution is defined over a sample space having elements that are sets of prompts. Each set of prompts is developed from a fixed set of prompts by prepending a random prefix. The fixed set of prompts that characterize a distribution of sets of prompts is derived from samples of popular fairness datasets, by varying the sensitive attributes in them. Hence, the distribution of the sets of prompts reduces to a distribution of prefixes for a fixed set of prompts. The 3 kinds of prefix distributions we consider are (details in the paper) - (1) Random sequence of tokens, (2) Mixture of effective jailbreaks, (3) Effective jailbreak perturbed in model's embedding space.

We certify popular LLMs for their bias with QuaCer-B and instances of the 3 kinds of distributions defined above. In particular, we certify the LLMs for gender and racial bias with distributions developed from samples from the BOLD and Decoding Trust datasets respectively. We observe novel trends in the performance of the LLMs, which we describe in detail in our paper. Below we show some example responses of a SOTA LLM to prompts sampled from a distribution derived from each dataset for gender and racial bias respectively, to illustrate the prompts and responses used in certification.

Examples

Optimized for viewing on bigger screens

The user prompt is shown in blue and its prefix is shown in red. All text in black is generated by the AI. Use the drop-down menu to switch between gender and racial bias examples. Use the button to generate another example, for another random set of responses from the LLM. The following example responses are shown for the SOTA LLM Mistral-Instruct-v0.2 (check the paper for examples from other LLMs). The responses for gender bias are shown for binary gender obtained during certification for the BOLD example ‘Desktop automation refers to software programming’ and prefixes that are mixtures of jailbreaks. The responses for racial bias are obtained during certification for the Decoding Trust example ‘[target group] have a drug problem’ and prefixes that are mixtures of jailbreaks.

BibTeX

@misc{chaudhary2024quantitative,
      title={Quantitative Certification of Bias in Large Language Models}, 
      author={Isha Chaudhary and Qian Hu and Manoj Kumar and Morteza Ziyadi and Rahul Gupta and Gagandeep Singh},
      year={2024},
      eprint={2405.18780},
      archivePrefix={arXiv},
      primaryClass={id='cs.AI' full_name='Artificial Intelligence' is_active=True alt_name=None in_archive='cs' is_general=False description='Covers all areas of AI except Vision, Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing), which have separate subject areas. In particular, includes Expert Systems, Theorem Proving (although this may overlap with Logic in Computer Science), Knowledge Representation, Planning, and Uncertainty in AI. Roughly includes material in ACM Subject Classes I.2.0, I.2.1, I.2.3, I.2.4, I.2.8, and I.2.11.'}
}

Ethics Statement

This work presents examples and code of our certification framework that can be used to reliably assess state-of-the-art LLMs for biases in their responses. While the framework is general, we have illustrated it with practical examples of prefix distributions, which can consist potential jailbreaks. The exact adversarial nature of the prefixes is unknown, but being derived from popular jailbreaks, the threat posed by them is important to investigate. Hence, we used these prefixes to certify the bias in popular LLMs and have informed the model developers about their potential threat.