The Need for Adjustments in Robustness Testing for Complex Systems Such as AI

Traditional reliability engineering concepts are easy to apply to hardware because their physical properties and relationships are generally well understood. Reliability engineers can confidently test industrial blenders, robot arms, and most home appliances. These methods allow engineers to estimate and calculate reliability for systems built from hardware components.

However, the foundations of traditional reliability engineering concepts crumble when applied to complex or unexplainable systems such as AI. If safety-critical AI was tested and validated in exactly the same way that reliability engineers test a toaster, they would need to test nearly infinite combinations of environmental factors that could affect the output of the AI.

So, how should someone quantify the reliability of AI when failure mechanisms are not transparent, and failure behavior is not explainable?

What is Reliability Engineering?
Imagine you are the proud owner of a new toaster. To celebrate the arrival of your new kitchen gadget, you decide to make toast. Two slices of bread are inserted into the toaster, and the lever is pressed down. Then, nothing happens. You’ve just encountered what reliability engineering considers a safe failure.

So, you return the broken toaster and get a replacement. This time it seems to work just fine and you turn your attention elsewhere while your bread cooks. After several minutes, you smell something burning. The new toaster didn’t release the bread in time and now there’s a small fire in your kitchen. You’ve just encountered what reliability engineering considers to be a dangerous failure.

Experiencing two back-to-back failures is enough to scare you away from this brand of toaster. A toaster that doesn’t work is frustrating, but a toaster that can start a fire is unacceptable, especially when it is used as intended (within its operational scope).


Reliability engineering helps determine how long a device can safely be used and in what environments (operational boundaries) the device can be used in. This practice uses data about failures from real-world scenarios, math, and statistics (probability distributions) to produce evidence of safety under specific conditions. This evidence is not perfect, but it is the reason that you can generally trust that your car will stop when you hit the brakes or that your toaster won’t catch your house on fire. However, using a “reliable” device for purposes other than its intended use can result in unpredictable and often dangerous outcomes.

Traditional Reliability Engineering Concepts
To collect reliability data from hardware components, engineers continuously test a sample of components until failure occurs. The time of failure is recorded and plotted on a probability distribution, a mathematical model used to estimate the likelihood that a component will fail at a given time. Using this data, reliability of a hardware component can be determined. If the type of failure is recorded alongside the time, the probability of specific failures can also be quantified.

Safety engineering is primarily interested in dangerous failures or events that can lead to dangerous failures (such as chain reactions). To reduce the risk of missing critical information, these failures are analyzed further. There are many methods and techniques that can be used to deep dive into failures. Some of these methods include but are not limited to using methods such as Reliability Block Diagrams (RBDs), Failure Mode and Effects Analyses (FMEAs), and Fault Tree Analyses (FTAs). It’s not necessary to understand all these methods in depth for this article, but it is good to know they exist.

Traditional reliability engineering is effective when four key assumptions remain true:

  1. Failures can be identified/documented without observation or experience
  2. The causes of failures are clearly traceable
  3. Failure behavior is explainable
  4. A smaller test sample can represent the larger population

These assumptions establish the limits of our understanding. When one of these is not met, confidence about what is happening in the larger population is called into question.

For example, imagine testing a new AI-enabled toaster that automatically adjusts temperature and cooking time. Several units catch fire during testing, yet all mechanical components are working as intended. When you try to replicate the failures, the toasters perform normally. In situations like this, identifying the cause and recommending a reliable fix becomes significantly more difficult because you can’t troubleshoot to find out the exact cause of failure.

Would you be able to recommend a solution that prevents the toasters from catching fire without removing AI from the equation? Did your solution have the toaster coming pre-packaged with a fire suppression system? Are you sure that fire is your only concern? Although it’s a silly example, it illustrates how addressing these types of failures can be difficult.

Complex System Adjustments
For complex systems such as AI, these assumptions often do not apply. Most AI systems are based on machine learning (ML) and are trained to produce the desired outputs. Although there are methods to verify the effectiveness of the training and the accuracy of the AI’s response, the reasoning behind the AI’s output is generally not easily understood by humans.

These challenges aren’t limited to toasters. Many existing complex systems face similar issues. Balancing bipedal robots is one well-known example. Humans would find it extremely difficult to describe or predict the minute movements required for a robot to maintain balance while it slips on liquid. The many small uncertainties across interconnected parts mean that the balancing behavior cannot be fully predicted in advance.

So, if the baseline assumptions of traditional reliability engineering no longer hold true, how can reliability be determined for complex and unexplainable systems? To obtain meaningful reliability values, additional steps and modifications to traditional reliability engineering principles are necessary.

Engineers working with such systems should:

  1. Treat the system as unsafe until proven otherwise
  2. Avoid presuming that explainability is attainable
  3. Emphasize incremental testing before comprehensive failure analysis
  4. Focus on discovering the limits of conditions known to cause failure without testing the entire universe
  5. Enable speedy iterative testing and analysis
Integration of Adjustments
Assume Danger First

With traditional reliability engineering, components are assumed to be capable of performing their function within defined limits. During the design phase, engineers determine how an elevator will be lifted; for example, by selecting the type and thickness of rope needed to safely carry the intended load. The accumulated design work establishes the operational limits for the final product.

For complex and unexplainable systems, a more conservative stance should be taken, in alignment with the scientific method. Unlike pushbuttons, screws, or switches, these systems do not have a long history of proven operational use. As a result, they should be assumed unsafe until proven otherwise. System testing should intentionally attempt to provoke unsafe behavior in order to inform boundaries of safe operation.

Assume a Black Box
Achieving compliance with existing established practices can elicit a false sense of confidence when applied to complex or unexplainable systems. Small changes in parameters or inputs can result in large and unexpected operational changes. Treating these systems as black boxes serves as a reminder that existing reliability, safety, and quality assurance techniques may not directly apply.

Test First – Analyze Later
Prioritizing testing before detailed failure analysis does not imply that failures cannot be predicted. Failures of complex systems can be predicted in many cases. However, a process that defines failure modes first and then validates them through testing is likely to overlook failures that arise from interactions within the black box. While it might be impossible to understand why a system fails under conditions such as light vibration combined with a bright LED light overhead, it is still possible to quantify and establish that operational boundary.

Limit Scope
Identifying the boundaries of complex or unexplainable systems without testing the entire universe requires constraining the domain in which the system is used, compiling known failure-inducing conditions, and applying design-of-experiment techniques. For further technical discussion of this topic, including information on Response Surface Methodology (RSM) see Why Five 9s is not Five Stars – The Need for Out-of-Distribution Robustness Testing in AI Functional Safety.

Iterate Fast
Ensuring the safety and reliability of complex and unexplainable systems requires additional time, effort, and cost. Reducing the time between testing and analysis is important for maintaining a reasonable timeframe. Taking years between releases can cripple innovation. Without an iterative test-and-analysis approach, even minor changes like replacing a simple sensor could result in years of work.

Conclusion
Using complex or unexplainable components such as AI in safety-related systems may soon be possible. However, their black box nature requires special consideration. Using unmodified traditional reliability engineering concepts can result in dangerous and unpredictable outcomes. Therefore, adoption of falsification-oriented test strategies, black box treatment, and systematic boundary identification helps ensure that reliability metrics are meaningful for these systems.

As AI and other complex technologies become more integrated into daily life and take on more safety-critical responsibilities, users must be able to trust them with the same confidence they place in familiar household appliances. Organizations integrating AI into safety-related systems should evaluate whether their current reliability validation approaches adequately address black box behavior and out-of-distribution conditions. Reynolds & Moore supports companies preparing for this transition by assisting with robustness strategy development, operational design domain definition, and adaptation of traditional safety practices for complex systems.

References

  1. E. Reynolds, A. Gautam, and F. Ziari, “Why Five 9s is not Five Stars – The Need for Out-of-Distribution Robustness Testing in AI Functional Safety,” Reynolds & Moore, LLC, Dallas, TX, U.S.A., Sep. 2025. Accessed: Jan. 05, 2026. [Online]. Available: https://reynolds-moore.com/

Author

Jordan Punch
Senior Functional Safety Engineer, Reynolds & Moore
Vila Nova de Gaia, Portugal
jordan.punch@reynolds-moore.com