How Do You Write Good Requirements for an AI-Enabled Perception System?
This is one of the genuinely hard open problems in systems engineering for autonomous systems. Not hard in the way that writing clear requirements is always hard — ambiguous language, conflicting stakeholders, scope creep. Hard in a deeper sense: the traditional machinery of requirements engineering was built on an assumption that does not hold for machine learning systems. That assumption is that you can, in principle, fully specify the behavior you want before you build the system.
For a perception system powered by a neural network, you cannot. The answer to the question in the title is not a clean framework you can download and apply. It is a set of complementary approaches, each covering a different dimension of the problem, with known gaps between them. This article lays those out honestly.
Why Traditional “Shall” Statements Break Down
A conventional requirement looks like this: The system shall detect a pedestrian at a distance of up to 80 meters under clear daytime conditions with a false negative rate not exceeding 0.1%.
That sentence looks well-formed. It is quantified, testable, and traceable. The problem is that it is only testable against the specific distribution of pedestrians in your test dataset. Change the lighting slightly, change the camera model, change the ethnic or clothing distribution of the pedestrian population, and the performance figure shifts. The requirement passes or fails depending on what you put in front of the sensor — not on any stable property of the system itself.
This is not an edge case failure of requirements engineering. It is structural. ML-based perception systems are functions of their training data and deployment environment simultaneously. A requirement that ignores training data and environment is not actually specifying what you think it is specifying.
There are three specific failure modes when teams try to apply classical “shall” requirements to perception:
The pass/fail illusion. A fixed performance threshold on a fixed test set creates the appearance of verified compliance while leaving the system’s behavior outside that test set entirely uncharacterized.
The missing context. Functional requirements say nothing about what happens when inputs fall outside the training distribution. For a perception system, out-of-distribution inputs are not exceptional — they are continuous. Every novel lighting condition, every new vehicle geometry, every unexpected road marking is potentially out-of-distribution.
Untraceable safety arguments. When you escalate to a system-level safety argument — “the vehicle will not strike a pedestrian” — you cannot derive that claim from a collection of probabilistic performance thresholds without a formal argument structure that classical requirements documents do not provide.
None of this means requirements are useless. It means you need a different set of requirement types.
Performance-Based Requirements
The most direct replacement for binary functional requirements is a performance envelope: a specification that defines acceptable behavior as a region rather than a point.
A performance-based requirement might read: The pedestrian detection subsystem shall achieve a true positive rate ≥ 0.97 and a false positive rate ≤ 0.02 when evaluated against the canonical evaluation dataset, stratified across all ODD segments defined in [ODD-SPEC-001], with no single ODD segment falling below TPR = 0.93.
Several things are happening here that are not happening in a classical “shall” statement:
- The requirement is explicitly tied to a defined dataset. That dataset becomes a first-class artifact, not a background assumption.
- Performance is stratified across ODD segments, so aggregate metrics cannot hide localized failure.
- Minimum per-segment floors prevent the system from trading catastrophic failure in one condition for excellent performance in another.
Performance-based requirements require more infrastructure to verify, but they are honest about what they are specifying. They also create natural traceability hooks to the dataset requirements below them.
Dataset Requirements
If a perception system’s behavior is partly a function of its training data, then training data is a system artifact and must be specified like one.
Dataset requirements are a distinct requirement type covering:
Coverage requirements. What conditions, object classes, weather states, lighting conditions, geographic regions, and sensor configurations must be represented in the training corpus? At what minimum frequency?
Balance requirements. Are rare-but-safety-critical scenarios represented at frequencies sufficient for the model to learn from them, even if those frequencies are artificially upsampled?
Annotation requirements. What labeling protocol governs ground truth? What inter-annotator agreement threshold is required? How are ambiguous or partially occluded objects handled?
Provenance and version control. What version of the training dataset produced the deployed model? This is a requirements question, not just an MLOps question — if the dataset changes, the compliance status of performance requirements must be re-evaluated.
Dataset requirements are uncomfortable for traditional systems engineers because they look like software quality requirements applied to data pipelines. That discomfort is a signal that the systems engineering process needs to expand its scope, not that dataset requirements are out of place.
Operational Design Domain Bounding
Operational Design Domain (ODD) specification is the practice of explicitly defining the conditions under which the system is intended to operate and, critically, the conditions under which it is not. For autonomous systems, the ODD is not a design decision made once — it is a requirements artifact that must be maintained, versioned, and traced.
ODD parameters typically include:
- Geographic constraints (mapped areas, road types, maximum speed limits)
- Environmental conditions (precipitation rate, visibility distance, temperature range)
- Traffic conditions (maximum traffic density, presence of construction zones)
- Time constraints (daytime operation only, or defined illumination thresholds)
- Infrastructure dependencies (lane markings present and visible, functioning traffic signals)
The ODD creates the boundary condition for all other perception requirements. A performance requirement is only meaningful inside the ODD it is scoped to. This means your requirements architecture needs to express ODD membership as a prerequisite condition for all performance claims, not as an afterthought.
ODD bounding also creates the requirement for an ODD monitoring function: the system must be able to detect when it is approaching or exceeding ODD boundaries and respond appropriately — typically by requesting driver intervention or executing a minimal risk maneuver. This is itself a perception requirement, and one that is frequently absent from early-stage requirements documents.
SOTIF: A Framework for the Unknown
ISO 21448 (Safety of the Intended Functionality, SOTIF) is the most useful framework currently available for structuring safety arguments around ML-based systems. Its core insight is that a system can be built exactly as specified and still be unsafe — not because of a fault, but because the specification was incomplete.
SOTIF divides the hazard space into four zones defined by two axes: known/unknown and safe/unsafe. The engineering challenge is to progressively move scenarios from the unknown-unsafe zone (behaviors you haven’t anticipated) into the known-safe zone through testing, simulation, and monitoring.
For perception system requirements, SOTIF has several practical implications:
Requirements for scenario coverage. You must specify not just the scenarios your system must handle correctly, but the process by which you identify and evaluate scenarios you haven’t yet thought of. This is a process requirement, not a performance requirement, and it belongs in your requirements documentation.
Triggering conditions. SOTIF asks you to specify the conditions that could trigger unsafe behavior — the edges of competence of the perception system. These become requirements for ODD constraints and for the monitoring system.
Evaluation of residual risk. After mitigation, SOTIF requires a judgment that residual risk is acceptable. That judgment has to be supported by evidence, and the evidence has to be traced back to requirements. This is where a coherent traceability model matters — not as bureaucracy, but as the mechanism that makes your safety argument auditable.
SOTIF does not tell you exactly how to write requirements for a neural network. No standard does yet. What it does is give you a vocabulary for the argument structure, which is the minimum you need before requirements can be written coherently.
Simulation-Based Testing and Coverage Requirements
Physical test driving cannot cover the ODD. The tail of rare but safety-critical scenarios — the child running between parked cars in rain at dusk, the mattress fallen from a truck on a highway — cannot be collected at statistically meaningful rates through on-road testing. This is not an opinion. It is arithmetic.
Simulation fills this gap, but simulation introduces its own requirements layer. You need to specify:
Scenario generation requirements. What parametric space must the simulation framework sample? What are the required densities of sampling in high-risk regions of that space? These are requirements on the test process, not on the system under test.
Fidelity requirements. At what level must the sensor simulation model the real sensor? A perception system validated only against a perfect sensor simulation is not validated against the real world. Fidelity requirements bound this gap and force it to be acknowledged.
Coverage metrics. Scenario coverage is not binary. A requirements document should specify coverage metrics — what fraction of the defined scenario space must be exercised, and with what sampling strategy — rather than leaving “sufficient simulation testing” as an undefined standard.
Transfer validation requirements. Any claim derived from simulation must be supported by evidence that simulation-validated performance transfers to real-world performance. This transfer gap is a known-unknown in SOTIF terms and must be explicitly addressed.
Simulation does not replace physical testing. It enables the physical testing program to focus on high-value confirmation rather than attempting impossible coverage.
Monitoring Requirements
A perception system that operates in a dynamic, open world requires ongoing monitoring requirements as part of its specification. These include:
Performance drift detection. The deployed model must be monitored for distribution shift — conditions in the real deployment that differ from the training distribution. This requires specifying the metrics, thresholds, and response procedures.
Anomaly detection. The system must characterize its own uncertainty and flag inputs that fall outside its competence envelope. This is a functional requirement on the perception architecture — not all architectures support it natively, and selecting one that does is a design decision driven by a requirement.
Safety event logging. Events approaching ODD boundaries or triggering uncertainty flags must be logged with sufficient fidelity to support post-hoc analysis. Log specification is a requirements task.
Monitoring requirements are often treated as an operations concern, separate from the engineering requirements process. For safety-critical autonomous systems, that separation is a mistake. Monitoring is part of the safety argument.
Keeping the Safety Case Coherent
The requirements types described above — performance envelopes, dataset specifications, ODD bounds, SOTIF-driven scenario coverage, simulation coverage metrics, monitoring obligations — do not naturally fit in a flat requirements document. They form a graph: ODD bounds constrain performance requirements, which depend on dataset requirements, which support simulation coverage claims, which flow up to system-level safety arguments.
This is where the tooling gap becomes concrete. Document-based requirements management tools were designed for the hierarchical decomposition of functional requirements. They are genuinely poor at expressing the dependency structure that makes an autonomy safety case coherent.
Teams working on autonomy programs have started migrating toward graph-based requirements management approaches that can express these dependencies explicitly. Flow Engineering is one of the platforms being used by autonomy teams for this purpose — specifically to trace performance requirements and ODD constraints through the system hierarchy, so that when a dataset version changes or an ODD boundary is revised, the impact on dependent requirements and safety claims is visible rather than buried in documents. For probabilistic systems where the safety argument depends on the coherence of multiple layers of evidence, that traceability model is not optional overhead. It is the mechanism that keeps the argument from quietly breaking.
What This Means in Practice
There is no complete solution to the requirements problem for AI-enabled perception systems. The field is in active development, and any practitioner who tells you they have a fully settled approach is either working on a very constrained system or not being honest about the open questions.
What you can do:
- Accept that performance-based, dataset-linked requirements are the minimum viable specification for an ML perception system. Classical “shall” statements, used alone, are insufficient.
- Treat the ODD as a requirements artifact, not a marketing document. Version it, trace it, and maintain it with the same discipline as functional requirements.
- Use SOTIF to structure your safety argument, even if imperfectly. Having a vocabulary for unknown-unsafe scenarios is better than not having one.
- Build simulation-based testing into your requirements process from the start, with explicit coverage specifications rather than ad hoc test plans.
- Require monitoring as part of the system specification, not as an afterthought.
- Use tooling that can express graph-structured dependencies among requirements, not just hierarchical decomposition.
The engineers who get this right are not the ones who find a way to make ML systems fit the old requirements framework. They are the ones who extend the framework far enough to match the actual problem.