How Do You Write Requirements for a System Driven by Machine Learning?

Most requirements engineers encounter this problem the same way: a system architect walks in and says the detection, classification, or decision logic is going to be a trained model, not a coded algorithm. The immediate question is whether your existing requirement format — the well-formed, verifiable, atomic shall-statement — can handle that. The honest answer is: partially, but not fully, and the gap matters.

This is not a theoretical concern. Medical devices with AI/ML components are now under FDA oversight. Autonomous aircraft systems are under EASA scrutiny. DoD programs routinely field AI-enabled targeting, logistics, and ISR systems. In each domain, the same structural inadequacy shows up: a requirement format designed to specify deterministic behavior is being stretched to cover systems whose behavior is fundamentally probabilistic and training-dependent.

Here is what the state of practice actually looks like, what the alternatives are, and how to build a requirements structure that works.


Why Shall-Statements Break Down for ML Systems

A well-written shall-statement is precise, verifiable, and unambiguous. “The system shall detect obstacle presence within 50ms at a range of 0–30m under visibility conditions above 100m.” That is testable. You run the scenario, measure the latency, pass or fail.

The problem is not that shall-statements are wrong for ML systems. It is that they are incomplete in a specific structural way. A trained model does not have a specification the way an algorithm does. Its behavior is encoded in weights derived from training data, shaped by a loss function, and evaluated against a held-out dataset. The requirement you write constrains the output of that process, but it cannot, by itself, constrain the process or the conditions under which the output is valid.

Three failure modes result from writing only output-level shall-statements for ML systems:

Unconstrained distribution shift. A model trained on daytime highway imagery may perform to specification during development and fail silently when deployed in tunnel lighting. The shall-statement that passed verification doesn’t know what conditions it was verified under.

Unmeasured proxy metrics. “The classifier shall achieve 95% accuracy” sounds like a requirement. But accuracy is a summary statistic that hides class imbalance, rare-event failure modes, and context-dependency. A model can satisfy this requirement and still be unsafe.

No specification of degraded-mode behavior. When a model encounters out-of-distribution input, what should the system do? If the requirement only specifies nominal operation, there is no contracted behavior for the degraded case — and in safety-critical systems, that is where most accidents occur.


The Four Requirement Types for ML Systems

Effective ML requirements are not a single layer. They are (at minimum) four distinct specification types that interact with each other and must be explicitly linked.

1. Training Data Quality Requirements

These specify what the training dataset must contain, how it must be labeled, and what distributional properties it must have. This is requirements engineering applied to data — a relatively new discipline.

Examples:

  • “The training dataset shall contain no fewer than 10,000 annotated instances per target class.”
  • “Annotation agreement shall meet or exceed an inter-rater reliability score (Cohen’s κ) of 0.85 across all safety-critical labels.”
  • “The dataset shall include samples from all five defined operational lighting conditions in proportions representative of the deployment environment.”

These requirements are verified through data audits, not system tests. They are upstream of the model, but they causally determine what the model can and cannot do.

2. Model Performance Bound Requirements

These are the parameterized output-level requirements, and they go beyond scalar accuracy metrics.

The modern form is a performance requirement specified as a function of operating condition and consequence class. Examples:

  • “Under Operational Design Domain Class A (clear daylight, dry surface, vehicle speed ≤ 50 km/h), the obstacle detection recall shall be ≥ 0.99 at a precision of ≥ 0.95.”
  • “Under ODD Class B (low visibility, precipitation), the system shall meet a minimum recall of 0.93 with no reduction in false-negative rate for Class 1 (pedestrian) targets.”
  • “The false positive rate for safety-critical alerts shall not exceed 1 per 10,000 inference cycles under any defined ODD class.”

The key move here is parametrization by operating condition. Performance is not a single number; it is a surface over the defined operational space.

3. Operational Design Domain (ODD) Constraints

ODD constraints define the boundary conditions under which the ML component’s performance specifications apply. Inside the ODD, the model is required to perform to spec. Outside it, the model is not — and that transition must be handled explicitly.

ODD constraints are effectively preconditions for model performance requirements. They specify the environmental, operational, and sensor-condition envelope: illumination ranges, weather states, sensor health thresholds, geographic constraints, traffic density bounds.

Without ODD constraints, a performance requirement is underspecified. With them, it becomes verifiable: you can run test scenarios that sample the ODD space and confirm coverage.

4. Fallback and Safety Behavior Requirements

This is the specification of what the system must do when:

  • Input is detected as out-of-ODD
  • Model confidence falls below a threshold
  • A required sensor input is degraded or absent
  • The model’s output contradicts a higher-level safety monitor

These requirements are the most safety-critical of the four types, and they are the most often missing in initial ML system specifications.

The form is typically a behavioral contract: “If [condition], then the system shall [action] within [time], and shall [notify/log/alert] as follows.” The “condition” usually references ODD monitoring outputs or confidence-score thresholds. The “action” is a deterministic fallback — hand to human operator, enter safe state, reject output and hold last-known-good value.

Fallback behavior requirements must be verifiable independently of the model. They are the boundary where ML system design reconnects with classical safety engineering.


Emerging Specification Approaches

Three specification patterns are gaining traction across safety-critical AI domains.

Parameterized Performance Requirements express performance as an explicit function of context. Rather than “the system shall achieve F1 ≥ 0.95,” you write a table or formal expression: F1(ODD_class, target_type, sensor_health) ≥ threshold(ODD_class, target_type). This forces the team to enumerate the ODD space and assign performance expectations to each region. It is more work upfront and substantially more testable.

Property-Based Specifications borrow from formal methods. Instead of specifying exact output values, they specify properties the output must satisfy: monotonicity (higher confidence score must correspond to higher ground-truth probability), robustness to small perturbations (a 5% change in input luminance shall not change classification output for Category A objects), and fairness properties (detection rate shall not differ by more than 2% between demographic sub-groups in the training distribution). Property-based specs are harder to write but expose failure modes that output-level specs miss entirely.

Behavioral Contracts define the system’s obligations as pre/post-condition pairs, similar to design-by-contract in software engineering. “Given: input image is within ODD Class A, confidence score ≥ 0.85. Then: classification output shall match ground truth label within acceptable error class.” Contracts can be layered — the ML model has a contract with the system, and the system has a contract with the operator. When a contract’s precondition fails, a fallback contract activates.

None of these approaches eliminates shall-statements. They extend them. A complete ML requirements specification uses shall-statements for interface, timing, resource, and safety constraints — and uses parameterized, property-based, or contract-based forms for the ML behavioral core.


What Regulators Are Actually Requiring

FDA: Software as a Medical Device (SaMD) and AI/ML Action Plan

The FDA’s AI/ML-Based SaMD Action Plan (2021, updated 2023) and the subsequent Good Machine Learning Practice (GMLP) guidance converge on a framework recognizable as the four-type structure above. The FDA requires sponsors to define the intended use and indications for use (which map to ODD constraints), specify algorithm performance in terms stratified by clinical context and patient subgroup (parameterized performance requirements), and document change control protocols that specify when a model update requires re-verification (training data quality governance).

The FDA does not mandate a specific requirements format, but the Technical Performance Assessment framework in their guidance is effectively a property-based specification approach applied to clinical performance claims.

EASA: Concept Paper on AI Roadmap and AMC for Machine Learning

EASA’s AI Roadmap 2.0 and the Acceptable Means of Compliance for machine learning in aviation (MOC for ML, first published 2021) introduce the concept of the Learning Assurance framework. Central to this framework is the ODD, called the Operational Design Domain in alignment with automotive terminology, and the explicit requirement for performance requirements at ODD boundaries.

EASA’s guidance specifically requires that requirements distinguish between in-distribution and out-of-distribution operation, that fallback behavior be pre-specified and independently verified, and that data management requirements form a traceable part of the system’s qualification evidence. This is exactly the four-type structure, arrived at through aviation safety engineering rather than ML theory.

DoD: Responsible AI and MIL guidance

DoD’s Responsible AI (RAI) guidelines and the more recent AI Acquisition and Sustainment guidance require test and evaluation approaches that address distributional coverage, not just point performance. The AI Test and Evaluation framework published by the DOT&E office requires T&E plans to specify the operational conditions under which AI systems will be evaluated — which forces programs to write ODD constraints in order to design test events.

The DoD has also moved toward behavioral assurance cases as an alternative to traditional specification-then-verify for AI-enabled systems, which maps closely to the behavioral contract approach described above.

All three regulatory bodies are converging on the same answer from different directions: traditional flat requirements are insufficient, the ODD is the organizing concept, and fallback behavior must be specified independently.


How Modern Tooling Handles Hybrid Requirements Structures

The practical problem with multi-type ML requirements is that they don’t fit neatly in a document. Training data quality requirements have dependencies on model performance requirements, which have preconditions defined by ODD constraints, which trigger fallback behavior requirements when violated. That is a graph, not a list.

Document-based requirements tools — which includes most legacy RM platforms — handle this by creating parallel documents and manually maintained trace matrices. This works until requirements change (which, for ML systems, they will — often, as models are retrained and operating domains expand). Manual trace maintenance breaks quickly under iterative model development.

This is where graph-based, AI-native tools provide a meaningful structural advantage. Flow Engineering is built around a graph model of requirements and system architecture, which means the four requirement types for an ML system can be represented as distinct node types with explicit typed relationships between them. A training data quality requirement links to the model performance requirement it enables. A performance requirement has a precondition link to its ODD constraint. An ODD constraint has an activation link to its fallback behavior specification. When any node changes, the impact propagates through the graph and surfaces in the tool — not in a manually-updated Excel trace matrix.

For teams working on AI-enabled hardware — autonomous systems, medical devices, avionics, defense platforms — Flow Engineering’s hybrid requirement structure handles the co-existence of classical hardware shall-statements and ML behavioral specifications in the same model. That matters because most AI-enabled systems are not pure ML systems; they are hardware platforms where ML handles specific perception or decision functions alongside deterministic subsystems. The traceability problem spans both.

Flow Engineering won’t write your ODD constraints for you, and it won’t tell you what performance thresholds are appropriate for your application — those are engineering judgments. What it does is make the structure of your requirements model explicit and maintainable as the ML components evolve.


Where to Start

If you are currently writing requirements for an ML-enabled system using only output-level shall-statements, here is a practical starting sequence:

First, identify the ML boundary. Where does the system hand off to the trained model, and where does it hand back? Write interface requirements for both transitions before writing any behavioral requirements.

Second, define the ODD. Enumerate the conditions under which the ML component is expected to operate. This is not a wish list — it is a contractual boundary. Be specific and testable.

Third, write parameterized performance requirements against the ODD. For each significant region of the ODD, specify the performance metrics, their thresholds, and the acceptable confidence intervals. Use stratified metrics, not aggregate ones.

Fourth, write training data quality requirements. Trace each performance claim back to a data requirement that makes it achievable. If you cannot identify the data condition that enables a performance requirement, you don’t understand the requirement well enough to verify it.

Fifth, write fallback behavior requirements last. Once the ODD is defined, the fallback specification becomes clear: it is the system’s contracted behavior for every condition not covered by the ODD.

The sequence matters. Starting with fallback behavior before defining the ODD produces vague safety requirements. Starting with performance before defining ODD produces unverifiable claims. The ODD is the foundation.

Writing requirements for ML systems is harder than writing requirements for deterministic systems. The difficulty is real, not a tooling problem or a process problem. But it is tractable, and the frameworks to do it well — from regulatory bodies, from the formal methods community, from industrial practice — now exist. The question is whether your requirements model is structured to use them.