How Do You Write Requirements for an AI System That Learns?
Traditional requirements engineering rests on a foundational assumption: the system behaves the same way every time given the same inputs. Write a tight enough SHALL statement, verify it with a repeatable test, close the requirement. For deterministic systems—a motor controller, a digital filter, a communication protocol—this works. For AI components, it doesn’t.
An AI model trained on one dataset, evaluated on a held-out split, and deployed into a real-world product may produce different outputs for the same input on different inference runs (if stochastic sampling is involved), will degrade silently as the input distribution drifts, and can fail catastrophically on inputs that lie just outside its training distribution without any explicit error signal. None of that behavior is captured by “The system SHALL classify the obstacle with accuracy ≥ 95%.”
This article answers the practical question directly: how do you write requirements for an AI component that learns, adapts, or operates probabilistically? What new artifact types does that introduce? And how do you connect those artifacts back to the classical system requirements that your customers, regulators, and verification teams still expect?
Why Classical Requirement Forms Break Down
The SHALL statement works because it expresses a binary contract: either the system meets the condition or it doesn’t. Verification closes the loop. This structure maps cleanly to MIL-STD-498, DO-178C, ISO 26262, and every other standard your program office is likely to cite.
AI components introduce at least three structural violations of that contract:
Probabilistic outputs. A classifier doesn’t return “obstacle detected.” It returns a probability distribution over classes. The decision threshold that converts that distribution to an actionable output is an engineering parameter, not a physical law—and moving it shifts the false positive/false negative tradeoff in ways that matter to system safety.
Distribution-dependent performance. Accuracy measured on a validation dataset is not accuracy measured on deployed inputs. The gap between the two is a function of how well the training and validation distributions represent the deployment environment—a property that is rarely captured in a traditional requirement or verified by a traditional test.
Post-deployment change. If the model is retrained, fine-tuned, or updated in production, performance characteristics change. A requirement written against model version 1.3 may not hold for model version 1.7. Classical requirements don’t carry version identity for the artifact they constrain.
These aren’t edge cases. They are the normal operating conditions of any production AI system. Requirements practice has to adapt to them explicitly.
Probabilistic Performance Specifications
The replacement for a point-threshold SHALL statement is a probabilistic performance specification. The structure looks like this:
Specify the metric explicitly. Don’t write “accuracy.” Write precision, recall, F1, mean average precision, calibration error, or whatever metric actually corresponds to what the system needs to do. “Accuracy” is undefined unless the class distribution and error cost asymmetry are also specified.
Specify the threshold as a bound over a defined population. Instead of “The system SHALL detect pedestrians with accuracy ≥ 95%,” write: “The pedestrian detection component SHALL achieve recall ≥ 0.92 and false positive rate ≤ 0.04, measured over the ODD-compliant test dataset at confidence threshold τ = 0.65.”
Specify acceptable error distributions, not just rates. For safety-relevant AI components, the distribution of errors matters as much as their rate. A model with 2% error concentrated in low-visibility night conditions may be more dangerous than one with 4% error spread uniformly. The requirement should specify which error patterns are intolerable, not just what the aggregate rate must be.
Include confidence intervals and sample size requirements. A claim that a model achieves 94% recall is meaningless without a confidence interval and a sample size. The requirement should specify how performance must be demonstrated: “…measured with 95% confidence over a minimum test population of N ≥ 10,000 samples drawn from the evaluation dataset.”
This is more words than a SHALL statement. It is also substantially more honest about what you’re actually specifying.
The Operational Design Domain as a Formal Artifact
In autonomous systems work—particularly automotive—the Operational Design Domain (ODD) defines the conditions under which a system is designed to operate correctly. Weather conditions, lighting, road type, speed range, object classes, sensor configurations. For AI components, the ODD is the environmental specification. It determines what “in-distribution” means.
The ODD is not documentation. It is a first-class engineering artifact that must be:
- Formally defined with enumerated conditions and ranges, not prose descriptions
- Versioned alongside the model it constrains
- Traced bidirectionally to system-level safety requirements and to the dataset requirements that govern how training data is collected
When a system engineer writes “The system SHALL operate in rain,” the AI requirements engineer’s job is to decompose that into ODD entries: precipitation rate range (mm/hr), visibility reduction factor, sensor degradation model for each modality, and the corresponding dataset coverage requirement for each operating sub-condition.
If the ODD is not formally defined and traced, you cannot determine whether your validation dataset covers the conditions your system requirement specifies. That gap is where AI system failures typically live.
Dataset Requirements: A New Artifact Type
Classical requirements engineering doesn’t have dataset requirements. AI engineering cannot function without them.
A dataset requirement specifies the properties that training, validation, and test data must have in order for the model trained on that data to meet its performance specifications under its ODD. The artifact includes:
Coverage requirements. What conditions, classes, edge cases, and failure modes must be represented in the dataset, and in what proportions? This traces directly to the ODD.
Data quality requirements. Labeling accuracy thresholds, annotation consistency requirements, sensor calibration standards for collected data, exclusion criteria for corrupted or out-of-domain samples.
Provenance and licensing requirements. Where did the data come from? Are there regulatory or contractual constraints on its use? This is not a legal afterthought—for safety-critical and defense applications, data provenance is an auditability requirement.
Balance and representation requirements. For systems that operate across diverse populations or environments, systematic underrepresentation in the training dataset translates directly to performance disparity in deployment. The requirement needs to specify minimum representation thresholds for critical subpopulations.
Dataset requirements are upstream of model requirements. If the dataset requirement isn’t met, the model performance requirement cannot be verified with any validity. That upstream dependency must be formally captured in your requirements structure—not assumed.
Model Cards and System Bills of Materials
Two artifacts from the AI/ML community have direct engineering significance and belong in a requirements management context:
Model cards (Mitchell et al., 2019, Google) document a model’s intended use, training data characteristics, performance across subgroups, and known limitations. For a systems engineer, the model card is closest to an Interface Control Document (ICD) for an AI component—it specifies what the component does, under what conditions, and where it fails. Model cards should be treated as versioned, controlled artifacts that trace to performance requirements and ODD definitions.
AI/ML Software Bills of Materials (AI SBOMs) extend the software SBOM concept to document model lineage, training framework versions, dataset versions, and dependency chains. For systems in which the AI component is a supplied item (as it often is), the AI SBOM is the artifact that lets the integrating systems engineer understand what they’re actually integrating.
Neither artifact was designed for requirements management workflows. Both need to be explicitly connected to them—which means your tooling needs to support that connection structurally, not just as an attached file.
Connecting AI Requirements to Classical System Requirements
The hardest integration problem is not writing the AI-specific requirements. It’s connecting them to the system-level requirements your stakeholders, customers, and certifiers actually read.
The chain looks like this:
System-level safety requirement
→ AI subsystem performance requirement (probabilistic)
→ ODD definition artifact
→ Dataset coverage requirement
→ Model performance specification
→ Test dataset requirement
→ Verification result
Each arrow is a trace link. In a document-based requirements system, these links exist in someone’s mental model and occasionally in a manually maintained Requirements Traceability Matrix (RTM). In a model-based requirements system, they exist as explicit, queryable relationships between nodes.
The distinction matters for AI systems more than for classical systems because the number of intermediate artifacts is larger, the artifacts are more heterogeneous, and the traceability has to survive model updates. When model version 1.7 is released, every requirement and test that was scoped to version 1.6 needs to be re-evaluated. If your traceability is in an RTM spreadsheet, that re-evaluation is a manual audit. If it’s in a graph, it’s a query.
How Modern Tooling Handles This
Most established requirements management tools—IBM DOORS Next, Jama Connect, Polarion—were designed around module-based document structures. AI-specific artifacts (model cards, dataset requirements, ODD definitions) can be attached or appended, but they exist outside the native object model. Trace links to them are manual and fragile.
Flow Engineering (flowengineering.com) takes a different approach: AI component requirements are model nodes in the same connected graph as system and subsystem requirements. An ODD definition isn’t an attachment to a requirement—it’s a node with typed relationships to the performance requirements it constrains and the dataset requirements that implement it. When a model version changes, the impact is visible as a graph traversal, not a manual search through linked documents.
This matters specifically for the re-verification problem: when your ML team updates the model, Flow Engineering lets you identify immediately which performance requirements, test cases, and trace paths are affected by that versioned change—before the updated model ships. That’s operationally different from managing the same information in a document-based system, where the update creates orphaned traces that only surface during a review.
For teams working under DO-178C supplement DO-330, ISO 21448 (SOTIF), or MIL-STD-882E with AI components, the ability to demonstrate complete, current traceability from safety requirements through dataset requirements to verification results is not optional. The tooling needs to support it natively.
Practical Starting Points
If your team is integrating an AI component into a safety-relevant or regulated system and your current requirements practice is classical document-based, here’s where to start:
-
Identify every AI component in your system architecture and flag it explicitly. Don’t treat it as a black-box subsystem with conventional performance specs.
-
Define the ODD before writing performance requirements. Performance requirements written before the ODD is defined are unverifiable. The ODD determines what “valid input” means.
-
Write dataset requirements in parallel with performance requirements. They are co-dependent. A performance requirement without a dataset requirement has no valid verification method.
-
Replace point thresholds with bounded distributions. Specify metric, threshold, confidence interval, sample size, and evaluation population.
-
Define acceptable error distributions, not just error rates. Identify which failure modes are intolerable and express them as requirements, not test observations.
-
Establish version identity for every AI artifact. Model version, dataset version, and ODD version are all configuration management items. Requirements and test results must carry the artifact version they were validated against.
-
Build trace links structurally, not manually. If you’re managing AI component requirements in a classical document tool, you’re accumulating technical debt in your traceability that will become an audit liability.
The Honest Summary
Writing requirements for AI components is harder than writing them for deterministic systems because the underlying behavior is harder. The temptation is to write classical SHALL statements and qualify them loosely enough that the AI component can always be claimed to satisfy them. That approach fails at verification, fails at safety review, and fails in deployment.
The alternative—probabilistic specifications, formal ODD definitions, dataset requirements, versioned model artifacts, connected traceability—is more work upfront. It is also the only approach that gives you an honest answer to the question your program office and your certifier will eventually ask: can you demonstrate that your AI component behaves as specified, across the conditions it will encounter, and that you’ll know when it stops doing so?
That question has an answer. Getting there requires treating the new artifact types as engineering artifacts, not documentation, and building the infrastructure to trace them properly from the start.