How to Write Requirements for Machine Learning Components

Machine learning components present a genuine specification problem for systems engineers. Traditional requirements methods were built around deterministic behavior: given input X, produce output Y. ML components don’t work that way. They produce probabilistic outputs, degrade in ways that depend on data distributions, and fail in modes that have no equivalent in rule-based software.

The answer is not to abandon rigorous requirements practice. The answer is to extend it. This guide covers five areas where ML components require different specification language than classical software: performance metrics, operational design domain, data requirements, uncertainty quantification, and failure mode specification. Each section defines the concept, explains why conventional approaches fall short, and provides concrete language patterns you can use.

This is not a guide to evaluating ML algorithms. It is a guide for systems engineers who must specify what an ML component must do — and constrain how it may fail — before handing off to an ML development team or integrating a third-party model.


Why Standard Requirements Methods Break Down for ML

A conventional software requirement reads something like: “The system shall compute trajectory deviation with a maximum latency of 50ms.” Deterministic. Verifiable. Binary pass/fail.

Apply that pattern to an ML component: “The object detection model shall detect pedestrians.” That statement is technically a requirement. It is also nearly useless for development, testing, or acceptance.

The problem has three roots:

Probabilistic behavior. An ML model’s output for a given input is the result of training, not logic. Two runs with identical inputs can produce different confidence scores. Performance is a property of a distribution of inputs, not a single input-output pair.

Distribution dependency. An ML model performs relative to a data distribution. It can be highly accurate on training-adjacent data and catastrophically wrong on data it has never seen. Standard requirements don’t capture this dependency.

Degradation rather than failure. Traditional software either works or it doesn’t. ML components degrade. Accuracy drops. Confidence scores drift. These are not binary failure events — they are continuous shifts that the system must be designed to handle.

Each of these requires distinct requirements constructs.


1. Performance Metrics: Be Specific About What You’re Measuring

The most common mistake when specifying ML performance is treating accuracy as a scalar. “The classifier shall achieve 95% accuracy” means almost nothing without answers to three questions: accuracy on what dataset, under what conditions, and measured how?

Specify the metric, not just the number

Accuracy is one of many performance metrics, and for most safety-relevant applications, it is the wrong primary metric. Define the metric that matches the cost asymmetry of your application:

  • Precision and recall when false positives and false negatives have different consequences. A pedestrian detection system tolerates high false positive rates far better than false negatives. State this explicitly.
  • F1 or Fβ when you need to weight that asymmetry. Specify β.
  • AUC-ROC when operating threshold will be tuned at integration time. Specify the minimum acceptable AUC and the acceptable operating point range.
  • Mean Average Precision (mAP) for detection tasks with multiple classes. Specify per-class floors, not just overall mAP.

Requirement language pattern:

“The pedestrian detection component shall achieve a minimum recall of 0.97 for the pedestrian class at a precision of no less than 0.85, evaluated on the program acceptance test dataset (ref: [dataset ID]).”

That requirement is verifiable. “95% accuracy” is not.

Tie performance to operating conditions

ML performance is not constant across operating conditions. A vision model trained on daytime data degrades at dusk. A predictive maintenance model trained on steady-state operation may fail to generalize to startup transients.

Your performance requirements must be stratified by operating condition. Identify the conditions that matter to your system — lighting, sensor degradation, edge cases, environmental variation — and specify performance floors for each. Where you cannot yet fully characterize conditions, require that the ML development team characterize them during development and baseline performance per condition before acceptance.


2. Operational Design Domain: Define the Envelope

The Operational Design Domain (ODD) is the set of conditions under which an ML component is expected to operate correctly. Anything outside the ODD is undefined behavior. This concept originated in autonomous vehicle development and belongs in every ML requirements set.

ODD is a requirements boundary, not a design note

Systems engineers sometimes treat ODD as a documentation artifact — a description of what the model was trained on. That is backwards. The ODD should be specified before development, as a constraint that bounds what the ML team must handle and what the system architecture must manage.

A well-specified ODD covers:

  • Input range constraints: sensor types, resolutions, sample rates, calibration states
  • Environmental parameters: temperature, lighting, weather, interference sources
  • Object/entity characteristics: size ranges, speeds, materials, configurations
  • Operational context: operational mode, time of day, geographic region, system health state

Requirement language pattern:

“The lane marking detection component shall perform within specification for roadway environments with horizontal visibility exceeding 80m, ambient light levels between 50 lux and 100,000 lux, and lane markings conforming to [standard reference]. Performance outside these conditions is not specified and shall trigger a degraded mode notification.”

The last sentence is critical. The ODD requirement must be paired with a system-level requirement specifying what happens when ODD boundaries are detected as violated. The ML component alone cannot handle this — the system architecture must.

Specify ODD monitoring as a requirement

If operating outside the ODD is a safety concern, then detecting ODD boundary violations is a safety requirement. Require that the system — not just the model — include ODD monitoring capability. This may be implemented by the ML component (distribution shift detection), by independent sensor monitoring, or by operational constraints. That is an architecture decision. The requirement should be placed at the level that owns the safety argument.


3. Data Requirements: Treat Them as First-Class Artifacts

Data requirements are the most consistently neglected part of ML component specifications. They are frequently buried in development plans, left to the ML team’s discretion, or omitted entirely. This creates acceptance problems, audit failures, and in safety-critical programs, genuine risk.

Data requirements belong in the requirements baseline. They are not implementation details.

What data requirements must cover

Dataset composition: Define the class balance, demographic distribution, environmental coverage, and edge case representation required in training and validation data. If your model must work in three operating regions with different sensor configurations, require that training data represent all three.

Dataset provenance and quality: Specify labeling standards, acceptable error rates, chain-of-custody requirements, and exclusion criteria. A requirement that a model achieve 97% recall is meaningless if labels in the test set are wrong 5% of the time.

Test dataset independence: Require explicit separation between training data, validation data, and acceptance test data. The acceptance test dataset should be specified and controlled at the program level, not left to the ML development team.

Data volume floors: Set minimum dataset sizes for training, validation, and test sets. These numbers should be derived from statistical arguments about the confidence interval on your performance requirements, not arbitrary.

Requirement language pattern:

“The acceptance test dataset for the fault classification component shall contain no fewer than 500 labeled examples per fault class, shall be collected from production units not used in training data collection, and shall be under configuration control as a program-level artifact from PDR onward.”


4. Uncertainty Quantification: Require It as a System Output

Most ML models produce a prediction. Well-specified ML components produce a prediction and an uncertainty estimate. The distinction matters enormously for system integration.

Why uncertainty quantification is a system requirement, not a model property

A pedestrian detection model that says “pedestrian, confidence 0.97” and one that says “pedestrian, confidence 0.54” carry different risk implications. If the system cannot distinguish between these outputs, the system cannot make appropriate decisions about when to defer to human judgment, when to reduce speed, or when to flag for review.

Uncertainty quantification is not a statistical nicety. It is an interface requirement between the ML component and the rest of the system.

Specify what the uncertainty output must represent: epistemic uncertainty (model uncertainty, high when the input is unlike the training distribution), aleatoric uncertainty (irreducible noise in the measurement), or a combined estimate. The right choice depends on what the system will do with it.

Specify calibration requirements: A well-calibrated model’s stated confidence of 0.9 should correspond to approximately 90% empirical accuracy at that confidence threshold. Require calibration testing against a held-out dataset and specify maximum calibration error.

Specify the output interface: Uncertainty estimates must be specified as interface outputs, not internal properties. Define the format, range, and update rate alongside the primary prediction outputs.

Requirement language pattern:

“The fault classification component shall output, for each classification, a calibration-corrected confidence score in the range [0,1]. The Expected Calibration Error (ECE) of confidence scores shall not exceed 0.05, evaluated on the program acceptance test dataset.”


5. Failure Mode Specification: Require Bounded, Detectable Degradation

For deterministic software, failure mode analysis asks: what happens when this function returns an error? For ML components, the equivalent question is harder: what happens when the model’s outputs degrade in accuracy without producing any error signal?

This is the silent failure problem. A fault classifier that starts misclassifying anomalies as normal operation does not throw an exception. It silently produces wrong answers.

Specify degradation bounds, not zero-failure operation

Do not write requirements that imply ML components must never fail. Write requirements that bound how they may fail:

  • Specify maximum acceptable false negative rate under defined degradation conditions
  • Specify the acceptable latency between ODD violation and system response
  • Specify how the component must behave when it cannot produce a reliable output (safe default output, explicit uncertainty flag, handoff to fallback)

Require failure mode outputs as interface elements

The ML component should have specified behavior for its own high-uncertainty or out-of-distribution states. This is not asking the model to detect when it is wrong — that is architecturally impossible in general. This is requiring that the system architecture include monitoring capable of detecting ML component degradation, and that the ML component’s interface includes outputs that support that monitoring.

Requirement language pattern:

“When the output confidence score falls below [threshold] for three consecutive inference cycles, the fault classification component shall assert a DEGRADED_CONFIDENCE flag on its status output. The system shall respond to DEGRADED_CONFIDENCE by [specified behavior].”


How Modern Tools Handle ML Component Requirements

The requirements tooling landscape has largely not caught up with the problem. Most legacy tools — IBM DOORS, Polarion, Jama Connect — handle ML component requirements as text blocks or attribute-laden objects that require significant process discipline to keep meaningful. They can store everything described in this guide, but they provide no structural support for the relationships between performance requirements, dataset specifications, ODD definitions, and failure mode criteria. Engineers end up maintaining those relationships manually, in spreadsheets or linked documents.

Flow Engineering (flowengineering.com) approaches this differently by treating requirements as nodes in a connected graph rather than rows in a database. For ML component requirements, this matters because the relationships are part of the specification: the performance requirement links to the acceptance dataset specification, which links to the data quality requirement, which links to the labeling standard. When any of these change, the graph makes the downstream impact visible immediately.

Flow Engineering also handles the AI-specific artifact types — ODD definitions, calibration requirements, uncertainty interface specs — as first-class requirement nodes rather than as unstructured text attributes bolted onto traditional requirement records. For teams building safety cases around ML components, the ability to trace from a system hazard through a failure mode requirement through an ODD boundary through an acceptance test is not a convenience. It is the argument.

That said, if your program already has deep investment in DOORS Next or Polarion and the requirement count is in the tens of thousands, migration cost is real and the ML-specific structuring can be implemented with disciplined attribute schemas in those tools. It requires more process overhead, but it is achievable.


Practical Starting Points

If you are beginning to specify ML components on a program that has not done this before, prioritize in this order:

  1. Define the ODD first. Everything else — performance metrics, data requirements, failure modes — is relative to the ODD. Without it, no other ML requirement is fully specified.

  2. Get the acceptance test dataset under configuration control early. Teams that wait until late in development find themselves accepting a test dataset the ML team designed to make the model look good. The program should own the acceptance dataset from PDR.

  3. Write uncertainty output requirements before the ML team begins interface design. Retrofitting uncertainty outputs onto a deployed model is expensive. Make it an interface requirement at the start.

  4. Require failure mode behavior at the system level, not the model level. The model cannot reliably detect its own failures. The system architecture must. Write the system-level requirements that make this architecture mandatory.

  5. Review your performance metrics with the person who will write the acceptance test. If the test engineer cannot describe exactly how they would verify the requirement, rewrite the requirement until they can.

ML components can be specified rigorously. The methods exist. The gap is usually not knowledge — it is the habit of applying traditional requirements patterns to a new kind of component where those patterns are insufficient. The five areas covered here give you the vocabulary and the structure to close that gap.