Better Performance Means It’s Safer…Right?

The Performance / Safety Language Barrier
AI performance metrics can be misleading. Terms like accuracy and precision mean different things depending on the discipline of the audience, and in a discussion spanning AI development and functional safety, the same term can carry two entirely different meanings. Functional safety applications can draw on elements of these metrics, but only when the analysis accounts for dangerous failures. Without that distinction, the numbers carry no meaningful safety information.

Performance isn’t safety.

Use Case
AI has demonstrated real potential in safety-critical systems across a wide range of applications. It can support hazard identification from situational context, enable a drone to automatically traverse an unstructured environment where traditional safety systems fall short, or flag completeness gaps in critical safety documentation such as a Failure Modes and Effects Analysis (FMEA).

However, when AI is incorporated into a safety-related system, demonstrating its reliability and safety in that context is often a requirement, though the specifics depend on the application. Terms like accuracy, recall, and precision begin to surface. A project team may encounter questions similar to the following:

‘We have a 99.5% accurate AI model; can we use this as evidence of achieved safety?’

Answering that question requires a closer examination of what it means to have an ‘accurate’ AI model.

Disclaimer: For the purposes of this article, AI models are treated as black boxes and focus on the inputs, outputs, and the environment in which a specific AI model will be deployed. It is important to note that this is not always the appropriate framing, but the core arguments presented here remain consistent regardless.

Contextualizing Functional Safety and Performance Metrics Boundaries
Functional safety must understand the AI model’s function in the safety system and where the AI model will be performing that function. Essentially, functional safety does not care if the AI model can accurately determine which shells are the most beautiful on a beach.

Functional safety cares about the automatic protective functions integrated with the AI model when the entire system is in its deployed environment to prevent risk of injury.

For any portion of a performance metric such as accuracy to be useful for functional safety, the metric must be limited to the safety-critical functionality being assessed. Combining safety-related and non-safety-related functions within the same performance metric undermines the analysis. The figures below illustrate the distinction: the leftmost image shows an inappropriate mix of safety and non-safety functions, while the rightmost shows safety metrics properly isolated.

Beneath the AI Performance Metrics
“We have a 99.5% accurate AI model; can we use this as evidence of achieved safety?”

For the purposes of this analysis, assume that the 99.5% accuracy value has been scoped to the safety-critical functionality under assessment, and that it represents the lowest measured value across the entire intended operating environment.

With those assumptions established, consider the definition for accuracy as it relates to an AI classification performance metric.

Accuracy: The proportion of all classifications that are correct (positive or negative), mathematically represented as:

Accuracy =
Correct ClassificationsTotal Classifications

Returning to the drone example, if the accuracy metric reflects the drone’s ability to correctly identify a safe landing zone, a 99.5% value indicates that the drone makes the correct classification 99.5% of the time. The question is whether that value can be used to calculate the dangerous failure rate and, by extension, the reliability of the AI model. It cannot, and the reason is as follows.

Accuracy does not distinguish between safe and dangerous failures. If the drone incorrectly classifies a safe landing zone as dangerous, no harm results from that misclassification. However, that incorrect classification and one that directs the drone toward a hazardous zone are treated identically within the accuracy metric. Safe and dangerous misclassification failures are combined without differentiation.

Additionally, accuracy is evaluated at the model level only. It does not account for diagnostics implemented by connected system elements that are designed to detect incorrect classifications before they result in harm. Diagnostics can significantly increase the detection rate of dangerous failures, but the accuracy metric does not reflect them.

Herein lies the core problem. Accuracy, diagnostic coverage (DC), and the safe failure fraction (SFF) are deceptively similar in structure. Equating incorrect classifications with dangerous failures, or correct classifications with safe failures, is a surprisingly easy error to make, particularly for those without a functional safety background.

Beneath the Safety-Relevant Metrics
Following the guidance set out in IEC 61508 requires going one level deeper.

DC is the ratio of detected dangerous failures to the total number of dangerous failures, mathematically represented as:

DC =
Detected Dangerous FailuresDangerous Failures

The SFF combines safe failures and dangerous detected failures, divided by all failures, represented mathematically as:

SFF =
Safe Failures + Dangerous Detected FailuresSafe Failures + Dangerous Detected Failures + Dangerous Undetected Failures

Among the equations in IEC 61508, DC and SFF are the most likely candidates for inappropriate value substitution. Specifically, there is a temptation to equate “incorrect classification” with “dangerous failures” or “dangerous undetected failures.”

A further source of confusion is that SFF, DC, and accuracy are all expressed as percentage-based values. IEC 61508 defines three benchmark levels for SFF and DC: Low (60%), Medium (90%), and High (99%), respectively. It should also be noted that SFF will always be equal to or greater than the diagnostic coverage for the same element or subsystem.

For most applications where learning of the AI model is not permitted after deployment, functional safety metrics are expected to fall within the 90% to 99% range. It is not uncommon for AI systems in specific applications to achieve accuracy, precision, and recall values approaching 99%. When that occurs, the accuracy of an AI model may exceed the diagnostic coverage of the subsystem to which it belongs. Substituting the higher accuracy value in place of DC would produce, on paper, an inflated level of achieved safety that does not reflect reality.

As illustrated by the drone landing example, these variables are not interchangeable.

An additional issue with the accuracy metric concerns the classification criteria itself. What criteria does the development team use to evaluate a correct classification? Landing a drone in a field may be acceptable in many cases, but a helicopter landing pad introduces hazards that classification by itself cannot capture. Depending on how the accuracy metrics and classification criteria were specified, it is reasonable to expect that the AI model would determine a grass-like helicopter landing pad to be a safe landing zone.

Functional safety should be providing input into AI performance metrics from the outset. When a development team states that an AI model is 99.5% accurate, the classification criteria may reflect only a few of the many relevant environmental factors, such as whether humans are present in the landing zone. Determining human presence in a landing zone is important, but it is not the only criteria that needs to be evaluated from functional safety’s point of view.

Takeaways and Conclusion
Can an AI performance metric such as accuracy contribute anything to the safety case? The accuracy metric itself is not sufficient as a safety argument, but the incorrect classifications that result in dangerous failures can feed directly into failure analysis documentation and support the development of diagnostics.

AI performance metrics are all affected by the absence of dangerous and safe-failure designation. Without the appropriate context, elements of these can migrate into calculations for DC and SFF where they do not belong, and the gap may not be identified until meaningful damage has been done to the safety case.

AI performance metrics carry the risk of creating a false sense of confidence. A statement like “the AI model can classify landing zones with 99.5% accuracy” produces an initial impression of safety assurance. When examined through the lens of functional safety, however, that confidence is unlikely to hold. Bridging the gap between AI performance and functional safety requires early collaboration between disciplines, clear scoping of performance metrics, and a shared understanding of what dangerous failure actually means in a given system.

Minding the Gap Between Safety and Performance
Integrating AI into safety-critical systems is a multidisciplinary challenge. The performance metrics that AI development teams rely on are not inherently aligned with the failure analysis frameworks that functional safety requires. Recognizing where those frameworks diverge, and establishing the collaboration needed to reconcile them, is essential to building a credible safety case.

Navigating that gap is central to the work at Reynolds & Moore, which spans the full development lifecycle, from early-stage concept and planning through final safety documentation, with a consistent focus on keeping safety and innovation in productive tension. Teams working through these challenges are encouraged to reach out.

References

  1. E. Reynolds, Omkar Salokhe, and Ashley Weis, “Safety-Relevant Metrics for Artificial Intelligence in Functional Safety Applications,” Reynolds & Moore, LLC, Dallas, TX, U.S.A., Sep. 2025. Accessed: Mar. 05, 2026. [Online]. Available: https://reynolds-moore.com/
  2. Ruan, G., Hu, T., Ding, C. et al. Fine-grained vehicle recognition under low light conditions using EfficientNet and image enhancement on LiDAR point cloud data. Sci Rep 15, 4691 (2025). https://doi.org/10.1038/s41598-025-89002-3

Author

Jordan Punch
Senior Functional Safety Engineer, Reynolds & Moore
Vila Nova de Gaia, Portugal
jordan.punch@reynolds-moore.com