How Do You Write Good Requirements for AI Components in a Safety-Critical System?

“Our product has a neural network in the perception stack. It detects objects and feeds downstream planning logic. We’ve validated it on our test dataset and the numbers look good — but our safety team keeps asking us to write ‘proper requirements’ for it, and we don’t know how to specify something we cannot fully specify. Where do we start?”

— Systems engineer, autonomous vehicle subsystem team

This question comes up constantly in teams shipping products at the intersection of machine learning and functional safety. The discomfort is legitimate. Traditional requirements engineering assumes you can write a deterministic, complete specification: given input X, the system shall produce output Y. Neural networks do not work this way. Their behavior is learned, not programmed. Their failure modes are statistical, not enumerable. Their edge cases are — almost by definition — the cases you didn’t think of.

That is a real constraint. It does not mean requirements cannot be written. It means the requirements must be written differently.


The Core Problem: Prescriptive Requirements Don’t Apply to Learned Models

Prescriptive requirements tell a component how to behave at the implementation level. For a classical algorithm, this is reasonable: you can specify the filter coefficients, the lookup table, the state machine transitions. For a neural network, prescriptive specification breaks down immediately. You cannot write requirements on weights. You cannot specify the internal representation. The model is defined by its training process, not by an authored design.

This does not mean the component is unspecifiable. It means you shift from prescriptive requirements to performance-based requirements — requirements that define what the system must achieve under defined conditions, without dictating how the model achieves it. This shift is already standard practice in other domains where implementations are not fully deterministic: electromagnetic compatibility, thermal performance, human factors. The approach works. The engineering discipline for applying it to ML components is now mature enough to execute.

The failure mode most teams encounter is writing performance requirements that are too coarse, too globally aggregated, or too disconnected from their operational domain to support a safety argument. “The object detector shall achieve 94% mean average precision” is not a safety requirement. It is a benchmark result. A safety requirement bounds behavior in the conditions that matter for safety — and specifies what happens when behavior degrades.


Specifying Minimum Performance Boundaries

The starting point for any ML component in a safety-critical system is defining the Operational Design Domain (ODD): the set of conditions within which the component is expected to operate safely. For a perception stack, this includes environmental conditions (lighting, weather, sensor noise levels), object classes and size ranges, scene densities, geographic constraints, and sensor mounting configurations.

Once the ODD is defined, performance requirements must be written at the ODD-partition level, not as global averages.

A weak requirement:

The object detection module shall achieve a minimum mAP of 92% on the validation dataset.

A stronger set of requirements:

The object detection module shall achieve a minimum pedestrian detection recall of 97% at a false positive rate not exceeding 0.1 per frame, under daylight conditions with visibility greater than 100m, for objects between 1.5m and 2.0m height at distances between 5m and 80m, as defined in ODD partition P-1.

The object detection module shall achieve a minimum pedestrian detection recall of 89% at a false positive rate not exceeding 0.15 per frame, under overcast or light rain conditions as defined in ODD partition P-2.

For all ODD partitions, confidence score calibration shall satisfy a maximum expected calibration error (ECE) of 0.05, measured on the held-out calibration set for each partition.

The difference is operational specificity. You are not measuring the model in aggregate — you are measuring it in the conditions your safety case depends on. Each partition-specific requirement becomes a node in your safety argument: “The system is safe in condition X because the ML component meets threshold Y in condition X.”

Write requirements at the partition level. Define measurement protocols in the requirement or in a referenced test specification. Specify who owns the test data and how the test set is kept independent from training.


Out-of-Distribution Behavior as a First-Class Requirement

Most ML safety failures do not happen when the model is wrong on a hard-but-in-distribution example. They happen when the model receives an input that bears no resemblance to its training data and confidently produces a plausible-looking but incorrect output. This is out-of-distribution (OOD) behavior, and it must be specified.

OOD requirements take two forms:

Detection requirements — the system must detect when inputs fall outside the ODD:

The perception module shall include an ODD monitoring function that produces a flag within 50ms when input data characteristics deviate from ODD parameters, including but not limited to: sensor dropout exceeding 20% of expected point cloud density, ambient light levels below the lower ODD bound, and image blur metrics exceeding the ODD threshold as defined in ICD-PERC-003.

Behavioral requirements on OOD detection — what must happen when OOD is flagged:

When the OOD monitor flags an out-of-domain condition, the object detection module shall transition its output state to DEGRADED within one inference cycle. Downstream systems receiving DEGRADED state shall not rely on detection outputs for path planning without explicit override logic, as specified in SYS-SAFE-017.

The DEGRADED state is not a failure. It is a specified, designed response. Writing this as a requirement makes it auditable, testable, and traceable to your safety case. Without it, OOD handling is an implicit assumption — which is the same as no requirement at all.


Runtime Monitoring Requirements

The model is not the complete AI component. The complete AI component includes the runtime monitoring infrastructure around it. Requirements must cover both.

Monitoring requirements specify what is observed at runtime, what triggers an alert or mode change, and what the latency bounds on that response are:

The runtime monitor shall compute a per-inference uncertainty estimate (using MC Dropout or equivalent method, as defined in IMP-MON-002) and compare it against the safety threshold T_unc defined per ODD partition. If the uncertainty estimate exceeds T_unc for three consecutive inference cycles, the monitor shall set the component health flag to UNCERTAIN and log the triggering inputs.

The runtime monitor shall measure inference latency on every cycle. If latency exceeds the worst-case execution time bound defined in TIMING-PERC-001 by more than 10%, the monitor shall log the event and increment the latency violation counter. If the counter exceeds 5 within a 10-second window, the monitor shall set the health flag to TIMING_FAULT.

All monitoring outputs shall be available to the system health manager via the interface defined in ICD-SHM-001 within one inference cycle of the triggering event.

These requirements make your monitoring infrastructure auditable. They also create testable claims: you can inject out-of-range uncertainty values and verify the monitor responds correctly. That is a test case, which supports your safety argument.


Structuring Requirements to Support Safety Case Argumentation

Requirements for AI components must be written with the safety case structure in mind. The safety case is the argument that the system is acceptably safe for its intended use. Each argument node needs evidence. Each piece of evidence needs a requirement it satisfies. If you cannot draw a line from a safety goal to a requirement to a test result, your argument has a gap.

The practical structure looks like this:

Safety GoalSystem Safety RequirementML Component RequirementTest SpecificationTest Result

For example:

  • Safety Goal (SG-04): The system shall not fail to detect a pedestrian in the path of the vehicle when a safe stopping distance exists.
  • System Safety Requirement (SSR-12): The perception subsystem shall detect pedestrians at distances between 5m and 80m with recall ≥ 97% under ODD partition P-1.
  • ML Component Requirement (PERC-REQ-008): The object detection module shall achieve pedestrian recall ≥ 97% at FPR ≤ 0.1 per frame under P-1 conditions, as measured by test protocol TP-PERC-008.
  • Test Specification (TP-PERC-008): Defines the test dataset, evaluation procedure, and pass/fail threshold.
  • Test Result (TR-PERC-008-v2.3): Measured recall 97.4%, FPR 0.07. Pass.

Each link in this chain is explicit and versioned. When the model is retrained or updated, the chain tells you which requirements are affected, which tests must be re-run, and which safety case nodes need to be re-evaluated. Without this structure, model updates become uncontrolled changes to your safety argument.


How Flow Engineering Supports AI Requirements for Safety Cases

The challenge of managing this structure — ODD-partitioned requirements, OOD behavior specifications, monitoring requirements, test traceability, safety case links — at scale across a real product is substantial. Document-based tools struggle with it because the relationships between requirements, tests, and safety case arguments are not well represented in a flat document hierarchy. You end up maintaining RTM spreadsheets manually, which means the links are always slightly out of date.

Flow Engineering is built around a graph-based requirements model, which maps directly onto the structure described above. Each requirement, test specification, safety goal, and architectural element is a node in the graph. The relationships between them — “this ML requirement satisfies this system safety requirement,” “this test verifies this requirement,” “this safety argument depends on this evidence” — are typed edges. When a requirement changes, the graph tells you what downstream nodes are affected. When a model is updated and a test result changes, the trace to the safety case argument is immediate.

For AI components specifically, Flow Engineering supports the authoring of performance-based requirements with embedded metadata: ODD partition references, measurement protocols, threshold values, and calibration set identifiers can all be structured fields on a requirement node, not prose buried in a document. That structured representation makes automated consistency checking possible — you can query whether every safety goal in the graph has a complete traceability chain down to a passing test result, and surface gaps before a review.

Flow Engineering is purpose-built for hardware and systems engineering teams, not adapted from a software project management tool. Teams working to standards like ISO 26262, DO-178C with ML supplements (like EASA’s AI roadmap or the FAA’s emerging ML guidance), or IEC 61508 find that the tool’s traceability model aligns with the evidence structure those standards require.

It is worth being clear about scope: Flow Engineering does not manage test execution infrastructure, model training pipelines, or dataset versioning. Those are separate concerns. What it manages is the requirements and traceability layer — the argument that your ML component, as tested, satisfies the safety goals your system depends on. That is the layer most teams find hardest to keep coherent as systems evolve.


Practical Starting Points

If your team is starting this process now, the sequence that tends to work:

  1. Define the ODD first. Before writing any ML requirements, document the operational conditions the model is expected to handle. Partition the ODD into discrete, testable subsets. Requirements written without ODD partitions are not safety requirements.

  2. Write performance requirements at the partition level. For each ODD partition and each safety-relevant object class or scenario, define a minimum acceptable recall, precision, or error bound, along with a measurement protocol.

  3. Write OOD detection requirements. Specify what constitutes an out-of-domain input for your system, how the model or monitor detects it, and what the behavioral response is.

  4. Write monitoring requirements. Specify what is observed at runtime, what thresholds trigger what responses, and what the latency bounds are.

  5. Build the traceability structure before you need it. The time to establish the link from safety goal to ML requirement to test specification is before the first safety review, not during it.

  6. Version your requirements with your model. Every time the model is retrained, the requirements and their associated test results must be re-evaluated. Treat a model update as a change-controlled event, not a continuous deployment.

The neural network in your perception stack is not unspecifiable. It requires a different kind of specification — one that bounds behavior rather than prescribing implementation, that covers degraded conditions as carefully as normal ones, and that supports a structured argument rather than a single benchmark number. That kind of specification is harder to write. It is also the kind that survives a safety audit.