Automated Fairness Testing of Large Language Models

Segura Rueda, SergioRomero Arjona, Miguel2025-03-252025-03-252024Romero Arjona, M. (2024). Automated Fairness Testing of Large Language Models. (Trabajo Fin de Máster Inédito). Universidad de Sevilla, Sevilla.https://hdl.handle.net/11441/170809Automating the testing process for large language models (LLMs) presents a complex challenge in the field of artificial intelligence (AI). This complexity arises mainly from the nature of LLMs, which contain billions of parameters, making traditional white-box testing techniques unfeasible. Moreover, existing black-box evaluation methods fail to fully capture the diversity and complexity of real-world applications. While effective manual strategies exist, they are costly in terms of time and resources. Deploying LLMs without exhaustive evaluation can lead to significant risks, such as generating harmful and biased responses. In response to these limitations, this work proposes an automated approach for evaluating fairness in LLMs. The main goal is to develop a set of tools that leverage LLMs’ capabilities to automatically generate and evaluate test cases, thus enabling effective bias detection. This approach aims to offer a more flexible, dynamic, and scalable evaluation, adaptable to the continuous evolution of LLMs and the wide variety of contexts in which they are deployed. The proposed methodology is based on metamorphic testing, a technique that involves systematically modifying test cases to observe the impact on the model’s outputs. We present an ecosystem composed of three tools: MUSE for test case generation, GENIE for executing cases on the models under evaluation, and GUARD-ME for analyzing the outputs. These tools enable the automation of the creation, execution, and analysis of test cases, significantly reducing the time and effort required for bias detection. The proposal has been evaluated on three LLMs: Llama3, Mistral, and Gemma. The results demonstrate high effectiveness, having detected gender, sexual orientation, and religious biases in all models under test, fully automatically. In one of the tests conducted with Mistral, all biases present in the model’s responses were correctly detected, with no non-biased cases being misclassified as biased. The approach has also proven to be applicable to different models, yielding very similar results between Gemma and Mistral. In both cases, F1-scores above 0.9 were achieved, reflecting a balance between precision (the ability to avoid identifying nonbiased cases as biased) and recall (the ability to identify all present biases). Additionally, the approach is robust to the non-determinism characteristic of LLMs, showing stable results across multiple runs.application/pdfVIII, 114 p.engAttribution-NonCommercial-NoDerivatives 4.0 Internationalhttp://creativecommons.org/licenses/by-nc-nd/4.0/FairnessBiasLarge language modelsAutomated testingMetamorphic testingAutomated Fairness Testing of Large Language Modelsinfo:eu-repo/semantics/masterThesisinfo:eu-repo/semantics/openAccess