Back to Articles

Testing AI Based Solutions

Testing AI Based Solutions
[
Blog
]
Table of contents
    TOC icon
    TOC icon up
    Ben Hall, Tech Director Lead | Martin Haurilak, Delivery Director | Peter Altosaar, Technical Lead
    Published:
    Who should read this:

    QA practitioners and managers interested in the evolving landscape of QA for solutions based on Large Language Models (LLMs).

    This blog will be a bit more technical than previous entries in our series, as we look at Quality Assurance for AI driven systems. We're not discussing how to test AI during model training, but how to test systems that include AI (specifically Large Language Models) as one of their components.

    First, we must understand that any AI model will provide output values that are probabilistic in nature. For a given input a series of outputs (generally between -1 and 1 or 0 and 1) are provided which either represent the categorization of the input or the probability that a given output is "the right one".

    When the problem domain is small it is easy to validate the output for each input and determine whether that output is satisfactory given the problem domain.  

    However, with LLMs the problem domain is extremely large (a large sample of human language), and the possible inputs and outputs are unconstrained. As a result, we need to adapt our approach to testing these models.

    Let’s start at the beginning. Whether a solution is AI-driven or not, the QA process should start early in the delivery lifecycle, at the same time as solution design or architecture. During this phase, the QA, Analysts, and architects need to collaborate on understanding what the AI can realistically do. For example, if 100% accuracy is required, but the AI can only provide 85%, this needs to be raised early.

    At this early stage, the team mainly has expertise and benchmarks to guide them. Several existing AI benchmarks can be used to help select the right AI model for a given situation. In the LLM space in particular, a vast number of benchmarks are available to draw from, which should help narrow down the right model for a given use case.

    Once you have integrated a model into your solution, the approach to testing should be driven by the risks you are trying to mitigate. As we've already covered in this blog series, an AI can provide great efficiency gains, but will on occasion be wrong, and a "trust but verify" approach is necessary in most cases. Your QA approach should account for the level of human-in-the-loop "verification" included in your use cases:

    • If the AI is used as an accelerator for a human expert who is very likely to catch any mistakes the AI makes, then your QA efforts regarding the LLM itself in the solution could be minimal.
    • If the AI is used to support a more junior staff member but is not client facing, the QA approach should cover a substantial number of the solution's scenarios involving the LLM to reduce the risk of failure or incorrect outcomes. This testing could however omit scenarios where staff consciously attempt to misguide the AI.
    • If the AI is directly exposed to the public, the QA approach should be extremely thorough, and account for hostile users who specifically try to misguide the AI.

    Now that we know "how much" testing we should account for, how do we go about testing the solution? Introducing AI doesn’t mean abandoning time-tested QA best practices, and the methodology doesn't fundamentally change.  Perform unit testing, then component testing, and finally integrated end-to-end testing. Your AI model will, in most cases, be only one piece of the puzzle.

    When it comes to testing your AI component, your QA approach should have two to four categories of validation:
    1. Classical human testing
    2. LLM component testing using LLM evaluation metrics
    3. Responsibility testing which validates responses for bias, fairness, etc.
    4. Abuse testing, which is analogous to security penetration testing
    Classical Human Testing or testing performed by people, typically QA Analysts.  

    When testing LLMs, QA resources should use two approaches:

    1. testing specific scenarios with expected outcomes
    2. open-ended exploratory testing where testers can freely interact with the AI and evaluate if its responses are appropriate

    Unlike traditional software tests with a single correct answer, LLM outputs can vary. These outputs can be evaluated with human judgement, since many results may be acceptable to varying degrees. Test plans should account for this flexibility.

    LLM Component Testing using LLM evaluation metrics. As mentioned earlier, the number of test permutations with LLMs is potentially limitless, so how can we apply automated verification? QA teams use evaluation tools that have been developed by the AI community, and help validate LLM responses for correctness, semantic similarity to the desired outcome, hallucination, and task specificity. These tools are typically referred to as "scorers" as they provide one or more scores (metrics) to report how well an LLM performed.  

    There are 2 broad categories of scorers:

    1. Statistical scorers use statistical analysis of the output to provide a score. These solutions are good for basic textual validation but perform poorly on evaluating the semantic and reasoning aspects of the models.
    2. Model based scorers which themselves use AI models to analyze the output of an LLM. These have proven to be more reliable in many cases.

    Many recent scoring techniques also use a combination of both approaches. For an in depth review of these metrics, which far surpasses the scope of this article, read this article by Jeffrey Ip from Confident AI.

    Responsibility testing and abuse testing. These tests may or may not be required depending on the context of the system being evaluated, as described earlier. The tools used to perform these tests are often the same as those used for LLM component testing. However, we call these tests out as they will exercise specific scenarios that are important when exposing the LLM to a customer or the public. They may either try to abuse the system or be offended by the responses of the system, or in some cases both.

    Responsibility testing checks whether an AI generates biased or toxic content. This is often done using specialized LLMs that can evaluate the outputs and decide if potentially offensive words are harmful in context, or harmlessly metaphorical or humorous.

    Abuse testing is the AI equivalent of penetration testing and should be part of your cybersecurity team’s testing toolkit. These tests aim to prevent scenarios where a user tries to “jailbreak” the model, tricking it into responding outside its intended domain. Such behavior can allow the user to bypass security measures or exploit the system for unintended purposes. A recent example of this is the “Skeleton Key” technique, which used a multi-step approach to manipulate the model into ignoring its guardrails.

    To wrap up, it isn't time yet to throw out our existing QA practices, but when it comes to validating the AI components of your solutions some new techniques are in order.

    As always, do not hesitate to let us know what you think of this content, or if you have any ideas for our future blog posts, drop us a comment or contact us at media@electricmind.com.

    Got a complex challenge?
    Let’s solve it – together, and for real
    Frequently Asked Questions