How to Write Testable Requirements for AI and Machine Learning Components in a Hardware System
The engineers who first tried to apply traditional requirements practices to machine learning components discovered the problem immediately: a requirement like “the object detection module shall correctly identify pedestrians” is completely untestable as written. Correctly according to what dataset? Under what lighting conditions? At what false positive rate?
AI and ML components do not behave deterministically. They generalize from training distributions, and their failure modes are statistical, contextual, and often emergent. Writing requirements that are both precise enough to verify and flexible enough to capture real operational conditions is one of the harder problems in modern systems engineering. This guide covers how to do it.
Why Standard Requirements Practice Breaks Down for AI Components
Traditional requirements for deterministic software follow a straightforward logic: define the input, define the output, define the transformation. The function either produces the correct output or it does not. Test coverage is a matter of enumerating states.
Neural networks and learned models do not have a single correct output for most inputs. They produce probability distributions, confidence scores, or ranked predictions. Their behavior degrades gracefully (or catastrophically) as input data drifts from the training distribution. A model that achieves 98% accuracy on a held-out test set may perform at 71% accuracy on production data six months after deployment if the input distribution has shifted.
This creates three distinct problems for requirements writers:
The threshold problem. What constitutes acceptable performance? 95% recall sounds good until you ask: at what precision? Measured on which population? Over what time window?
The distribution problem. A requirement verified on a test dataset is only meaningful if that dataset is representative of operational conditions. Specifying what “representative” means is itself a requirements activity.
The drift problem. AI components can degrade silently over time as real-world conditions evolve. Requirements must address not just initial performance but ongoing performance monitoring.
Specifying AI Performance Requirements: The Core Metrics
Accuracy, Precision, and Recall Are Not Interchangeable
For any classification or detection task, requirements must specify the metric that maps to the operational risk profile — not just the metric that makes the model look best.
- Accuracy (correct predictions / total predictions) is appropriate when false positives and false negatives carry roughly equal cost. It is almost never the right primary metric for safety-critical applications.
- Precision (true positives / predicted positives) is the right primary metric when false alarms carry high cost — for example, an automated braking system that triggers incorrectly.
- Recall (true positives / actual positives) is the right primary metric when missed detections carry high cost — for example, a tumor detection system in medical imaging.
- F1 score and its variants (F-beta) allow weighting when both error types matter but at different rates.
A well-formed AI performance requirement specifies the metric, the threshold, the measurement population, and the confidence interval:
“The pedestrian detection module shall achieve a minimum recall of 0.97 and minimum precision of 0.90 when evaluated on the operational validation dataset (OVD-2026-PED), at a confidence level of 95% using bootstrapped test sampling with n ≥ 10,000 images.”
That requirement is testable. The original phrasing was not.
Distribution Shift Robustness Requirements
Distribution shift — the divergence between training data statistics and operational data statistics — is one of the most common failure modes in deployed AI systems. Requirements must address it explicitly.
There are three categories of shift to specify against:
Covariate shift: Input feature distributions change (e.g., lighting conditions change seasonally; sensor calibration drifts over hardware lifecycle).
Label shift: The underlying class frequencies change (e.g., a medical diagnosis model deployed in a new patient population with different disease prevalence).
Concept drift: The relationship between inputs and correct outputs changes (e.g., a fraud detection model faces new attack patterns not represented in training).
Requirements for robustness should specify:
- Maximum tolerated performance degradation under defined shift scenarios
- The shift scenarios themselves (defined quantitatively, not qualitatively)
- Monitoring thresholds that trigger re-evaluation or model update
“Under simulated covariate shift corresponding to a 15% change in mean luminance and 10% change in contrast variance from the training distribution, the detection module shall maintain recall no lower than 0.93.”
This is verifiable. “The system shall be robust to environmental variation” is not.
Operational Requirements vs. Training Data Requirements
This distinction is routinely collapsed in practice and consistently creates compliance problems. They are different artifacts serving different purposes.
Operational Requirements
Operational requirements describe what the AI component must do in deployment. They are part of the system requirements baseline and are subject to standard change control, traceability, and verification. They specify:
- Performance thresholds (as described above)
- Input space bounds (the operational design domain, or ODD in autonomous systems parlance)
- Latency and throughput requirements
- Safe failure behavior (what the system does when confidence falls below threshold)
- Monitoring and reporting obligations
Training Data Requirements
Training data requirements describe the data that must be used to develop the model, and the properties that data must have. They are upstream of the model and are typically part of a Data Management Plan or equivalent document. They specify:
- Minimum dataset size and class balance requirements
- Data collection conditions (sensors, environments, demographics)
- Labeling methodology and inter-annotator agreement thresholds
- Data validation procedures (how do you confirm the dataset meets the specification)
- Exclusion criteria (what data must not be included)
The relationship between these two document types must be explicit and traceable. If an operational requirement says the system shall achieve 0.97 recall on pediatric patients, the training data requirement must specify that the training dataset includes a minimum percentage of pediatric cases at defined age bands.
Regulators — particularly FDA — are increasingly requiring this traceability. A model that meets performance requirements on a biased test set is not a model that meets the operational requirement; it is a model that has been evaluated incorrectly.
V&V for AI Components: What Changes
Traditional software verification is largely a process of demonstrating that the implementation matches the specification. For deterministic code, structural coverage metrics (MC/DC, branch coverage) provide a framework for claiming that enough of the logic has been exercised.
None of this transfers cleanly to learned models.
What Does Transfer
Requirement-level testing still applies. You can write test procedures against AI performance requirements exactly as you would against any other requirement, provided the requirements are testable as described above. The difference is that test results are statistical, not binary.
Independence of test data is essential. Test datasets must be demonstrably independent from training data. This sounds obvious; in practice it requires explicit data provenance tracking and is one of the gaps regulators have cited most frequently.
Failure mode analysis must be conducted. For safety-critical systems, AI Failure Mode and Effects Analysis (AI-FMEA) should enumerate the ways the model can fail, the conditions that make each failure more likely, and the system-level mitigations. This feeds directly into V&V planning.
What Is Different
Coverage metrics for neural networks are not mature. Neuron coverage and similar metrics have been proposed as analogs to structural coverage but do not yet have the regulatory acceptance that MC/DC has in DO-178C. Most current guidance allows alternative approaches with adequate justification.
Behavioral testing replaces specification-based testing as the primary method. Since you cannot verify a neural network against its “code,” you verify it against its behavior across a comprehensive set of test scenarios. This requires large, curated, labeled test datasets — which are themselves engineering artifacts subject to configuration management.
Shadow deployment and staged rollout are verification activities. For systems where sufficient test data cannot be collected in a lab environment, shadow deployment (running the model on live data without acting on its outputs) is a recognized method for accumulating performance evidence. This should be planned as part of the V&V strategy, not bolted on post-development.
Ongoing monitoring is part of the V&V argument. Particularly for systems with long operational lives, the initial verification provides a point-in-time claim. Maintaining that claim requires monitoring that the operational distribution has not drifted beyond the bounds specified in the robustness requirements.
Regulatory Guidance: What EASA, FAA, and FDA Have Published
EASA (AI Roadmap and Concept Paper on AI)
EASA’s AI Roadmap, now in its second iteration, and the associated Concept Paper on AI for Aviation address AI components in certified systems. Key requirements from an engineering perspective:
- Learning Assurance: EASA frames AI development as a “learning process” with defined phases (concept, design, verification, production, monitoring). Requirements for each phase must be documented.
- Operational Design Domain: The ODD must be explicitly specified, bounded, and verified. The AI component’s behavior outside the ODD must be defined.
- Explainability requirements: For human-in-the-loop systems, the AI must provide outputs in a form that allows human operators to exercise meaningful oversight. This may be a system-level requirement that flows down to the AI component.
FAA (AC 20-180 and AI/ML Issue Papers)
The FAA’s approach has been to issue issue papers on a project-by-project basis while developing broader guidance. The consistent themes across published issue papers:
- Means of Compliance for AI/ML components in DO-178C certified systems require problem formulation documents, data management plans, and explicit treatment of known limitations.
- Performance requirements must include bounds on acceptable degradation under identified distributional shifts.
- The distinction between the “trained model” artifact and the “inference engine” software artifact must be maintained through the certification lifecycle.
FDA (Predetermined Change Control Plans and AI/ML Action Plan)
FDA has taken the most progressive regulatory stance, recognizing that AI models in medical devices will change after initial clearance.
- Predetermined Change Control Plans (PCCPs): Manufacturers must specify in advance what changes to the AI model are permissible without requiring a new submission, the performance boundaries that trigger a required re-submission, and the methods that will be used to evaluate changes.
- Total Product Lifecycle (TPLC) approach: Performance requirements are not static. FDA expects manufacturers to monitor, detect, and respond to performance degradation in deployment.
- Transparency requirements: Labeling for AI-enabled devices must describe the model’s intended use, training data characteristics, known limitations, and performance in relevant subpopulations.
The practical implication of FDA guidance is that requirements must be written with change management in mind from the start. A requirement like “the model shall maintain AUC ≥ 0.91 across all patient subgroups identified in Table 3” creates a clear trigger for PCCP evaluation if monitoring shows AUC dropping in any subgroup.
Managing AI Requirements Within the System Baseline
AI requirements do not exist in isolation. They are part of a larger system requirements baseline that includes hardware requirements, interface requirements, safety requirements, and regulatory requirements. The challenge is ensuring that AI-specific requirements are traceable, visible, and protected from inadvertent scope changes as the system evolves.
Several practices make this tractable:
Tag AI requirements as a distinct category. Whether you are using attributes, custom fields, or module structure in your requirements tool, AI requirements need to be filterable as a group. When regulatory questions arrive — and they will — being able to pull all AI performance requirements, their verification status, and their links to training data specifications in a single view is not optional.
Maintain bidirectional traceability to the data management plan. AI operational requirements trace down to test procedures, but they also trace laterally to training data requirements and upstream to safety requirements. This is a graph problem, not a list problem. Tools that manage requirements as a flat document structure make this traceability extremely difficult to maintain.
Version the model alongside the requirements. The trained model artifact — the weights file, the inference graph, the ONNX export — must be under configuration management in the same system as the requirements it is claimed to satisfy. A requirements change that does not trigger a review of the model version (or vice versa) creates a silent compliance gap.
Platforms like Flow Engineering are built around a graph-based model where requirements, design artifacts, and verification evidence are nodes with typed relationships rather than documents with linked references. For AI requirements specifically, this means an operational performance requirement can maintain live links to its parent safety requirement, its sibling training data requirement, the specific model version it applies to, the test dataset used for verification, and the monitoring threshold that triggers re-evaluation — all within the same connected baseline. When a regulatory reviewer asks for the complete evidence package for the pedestrian detection requirement, the answer is a traversal, not a document hunt.
This matters more for AI requirements than for most other requirement types because the evidence structure is more complex. The verification argument for a deterministic function is linear: requirement → test procedure → test result. The verification argument for an AI component is a network: operational requirement → test dataset → statistical test result → training data requirement → data validation report → monitoring plan → deployment performance data. Managing that network in a spreadsheet or a document creates compliance risk that compounds with system complexity.
Practical Starting Points
If your team is currently managing AI requirements the same way as all other requirements, here is a prioritized sequence for improving your practice:
-
Audit your existing AI requirements for testability. Any requirement that does not specify a metric, a threshold, and a measurement population is not testable. Rewrite before you write test procedures.
-
Separate your training data requirements into a distinct document type. Give them their own identifier schema, their own review cycle, and explicit traceability to the operational requirements they support.
-
Define your operational design domain explicitly. Before you can specify robustness requirements, you need a bounded specification of the conditions under which the system is expected to operate. This is a requirements artifact, not a design artifact.
-
Build distribution shift scenarios into your V&V plan. Identify the three to five most plausible shift scenarios for your application, define them quantitatively, and specify the performance thresholds that apply under each.
-
Establish configuration management linkage between model versions and the requirements they satisfy. If your CM system and your requirements tool cannot communicate, build a manual bridge. Silent gaps between model versions and requirements versions are audit failures waiting to happen.
-
Map your requirements to the applicable regulatory guidance. For aviation programs, identify which requirements are addressed by EASA/FAA AI-specific guidance. For medical devices, identify which requirements are within scope of the PCCP. Regulatory mapping belongs in the requirements baseline, not in a presentation deck.
AI components are not going to become simpler or less common in hardware systems. The engineering discipline for specifying and verifying them is young but developing fast. The teams that build rigorous AI requirements practices now — with proper metrics, proper separation of training and operational concerns, and proper tooling — will be better positioned as regulatory requirements continue to tighten.