Quantitative Certification of Bias in Large Language Models

University of Illinois Urbana-Champaign1, Amazon 2, Pyron 3, VMWare Research 4

Abstract

Content Warning: This work contains examples of offensive language.

Large Language Models (LLMs) can produce biased responses that can cause representational harms. However, conventional studies are insufficient to thoroughly evaluate LLM bias, as they can not scale to large number of inputs and provide no guarantees. Therefore, we propose the first framework, QuaCer-B (Quantitative Certification of Bias) that certifies LLMs for bias on distributions of prompts. A certificate consists of high-confidence bounds on the probability of unbiased LLM responses for any set of prompts mentioning various demographic groups, sampled from a distribution. We illustrate the bias certification for distributions of prompts created by applying varying prefixes drawn from a prefix distributions, to a given set of prompts. We consider prefix distributions for random token sequences, mixtures of manual jailbreaks, and jailbreaks in the LLM’s embedding space to certify bias. We obtain non-trivial certified bounds on the probability of unbiased responses of SOTA LLMs, exposing their vulnerabilities over distributions of prompts generated from computationally inexpensive distributions of prefixes.

Media coverage

  • [Jul 2024] Thanks to Bruce Adams and Siebel School of Computing and Data Science for writing about our work here https://siebelschool.illinois.edu/news/bias-LLMs.
  • Overview

    Large Language Models (LLMs) have shown impressive performance as chatbots, and are hence used by millions of people worldwide. This, however, brings their safety and trustworthiness to the forefront, making it imperative to guarantee their reliability. Prior work has generally focused on establishing the trust in LLMs using evaluations on standard benchmarks. This analysis, however, is insufficient due to the limitations of the benchmarking datasets, their use in LLMs' safety training, and the lack of guarantees through benchmarking. As an alternative, we propose quantitative certificates for LLMs and develop a novel framework, QuaCer-B, to quantitatively certify LLMs for bias in their responses. We define bias as an assymetry in the LLM's responses for a set of prompts that differ only by a sensitive attribute.

    QuaCer-B considers a given distribution of sets of prompts to certify a target LLM. The certificate consists of high-confidence bounds on the probability of obtaining a biased response from the LLM for a randomly sampled prompt from the distribution. The figure below presents an overview of QuaCer-B on an example distribution of prompts developed from a sample from the BOLD dataset.

    Quacer-B
    (Overview of QuaCer-B): QuaCer-B is a quantitative certification framework to certify the bias in the responses of a target LLM for a random set of prompts that differ by their sensitive attribute. In specific instantiations, QuaCer-B samples a (a) set of prefixes from a given distribution and prepends them to a prompt set to form (b) the prompts given to the target LLM. (c) The target LLM’s responses are checked for (d) bias by a bias detector, whose results are fed into a certifier. (e) The certifier computes bounds on the probability of obtaining biased responses from the target LLM for any set of prompts formed with a random prefix from the distribution.

    We illustrate certificates generated by QuaCer-B for the popular, SOTA LLMs with 3 kinds of distributions. Each distribution is defined over a sample space having elements that are sets of prompts. Each set of prompts is developed from a fixed set of prompts by prepending a random prefix. The fixed set of prompts that characterize a distribution of sets of prompts is derived from samples of popular fairness datasets, by varying the sensitive attributes in them. Hence, the distribution of the sets of prompts reduces to a distribution of prefixes for a fixed set of prompts. The 3 kinds of prefix distributions we consider are (details in the paper) - (1) Random sequence of tokens, (2) Mixture of effective jailbreaks, (3) Effective jailbreak perturbed in model's embedding space.

    We certify popular LLMs for their bias with QuaCer-B and instances of the 3 kinds of distributions defined above. In particular, we certify the LLMs for gender and racial bias with distributions developed from samples from the BOLD and Decoding Trust datasets respectively. We observe novel trends in the performance of the LLMs, which we describe in detail in our paper. Below we show some example responses of a SOTA LLM to prompts sampled from a distribution derived from each dataset for gender and racial bias respectively, to illustrate the prompts and responses used in certification.

    Examples

    Optimized for viewing on bigger screens

    The user prompt is shown in blue and its prefix is shown in red. All text in black is generated by the AI. Use the drop-down menu to switch between gender and racial bias examples. Use the button to generate another example, for another random set of responses from the LLM. The following example responses are shown for the SOTA LLM Mistral-Instruct-v0.2 (check the paper for examples from other LLMs). The responses for gender bias are shown for binary gender obtained during certification for the BOLD example ‘Desktop automation refers to software programming’ and prefixes that are mixtures of jailbreaks. The responses for racial bias are obtained during certification for the Decoding Trust example ‘[target group] have a drug problem’ and prefixes that are mixtures of jailbreaks.

    BibTeX

    @misc{chaudhary2024quantitativecertificationbiaslarge,
          title={Quantitative Certification of Bias in Large Language Models}, 
          author={Isha Chaudhary and Qian Hu and Manoj Kumar and Morteza Ziyadi and Rahul Gupta and Gagandeep Singh},
          year={2024},
          eprint={2405.18780},
          archivePrefix={arXiv},
          primaryClass={cs.AI},
          url={https://arxiv.org/abs/2405.18780}, 
    }

    Ethics Statement

    This work presents examples and code of our certification framework that can be used to reliably assess state-of-the-art LLMs for biases in their responses. While the framework is general, we have illustrated it with practical examples of prefix distributions, which can consist potential jailbreaks. The exact adversarial nature of the prefixes is unknown, but being derived from popular jailbreaks, the threat posed by them is important to investigate. Hence, we used these prefixes to certify the bias in popular LLMs and have informed the model developers about their potential threat.