Can Machine Learning Models Be Verified Under Existing Aerospace Certification Standards?

This is the most consequential open question in aviation safety right now. Not because the answer is obviously no — it isn’t — but because the answer is genuinely unresolved, the stakes are aircraft-level, and the industry is moving forward anyway. ML-based functions are already being fielded in non-safety-critical roles aboard certified aircraft. They are being designed into safety-critical functions in programs currently in development. Certification authorities know this and are working on it. The frameworks are incomplete. The timelines are real.

Understanding where the gap is — technically, not just bureaucratically — is essential for any engineer working on systems that contain or interface with learned models.

What DO-178C and ARP4754A Were Designed to Verify

DO-178C (Software Considerations in Airborne Systems and Equipment Certification) and ARP4754A (Guidelines for Development of Civil Aircraft and Systems) were developed to answer a specific engineering question: how do you demonstrate that software and systems behave exactly as intended, under all conditions their designers anticipated?

The answer those standards provide is built on three pillars.

Traceability. Every requirement at the system level must trace to software requirements. Every software requirement must trace to code. Every line of code must trace back to a requirement that justified its existence. The chain runs both directions. An aircraft function that cannot be traced from system need to executable artifact is, by definition, not certifiable under these standards.

Determinism. Given the same inputs, a DO-178C-compliant software function must produce the same outputs, every time, on every hardware platform that runs it. This property is what makes structural coverage analysis meaningful. If you can trace all branches in code, you can demonstrate that all intended behaviors have been exercised and all unintended behaviors have been excluded.

Coverage-based verification. Test coverage — statement coverage, decision coverage, MC/DC for DAL A software — is the mechanism by which you argue completeness. If you have exercised every path through the logic, and every path produces a correct result, the verification is complete.

ARP4754A sits above DO-178C. It governs the system-level process: how functions are allocated, how failure conditions are analyzed (FMEA, FTA), how safety requirements flow down to hardware and software. The standard assumes that once requirements are properly derived and allocated, the software development process (DO-178C) can verify that those requirements are correctly implemented.

Both standards assume that the thing being verified is a function whose behavior is fully specified before implementation begins. The specification is the ground truth. The test is a comparison against the specification.

Why ML Models Break These Assumptions

A trained ML model does not have a specification in that sense. It has a training objective, a dataset, an architecture, and learned parameters — billions of numbers whose individual contributions to any given output are not interpretable at the level that DO-178C’s traceability requirements demand.

This creates specific, structural incompatibilities.

The determinism problem. A neural network, run on the same input, will produce the same output — in that narrow sense it is deterministic. But a neural network trained on different data, or with different random seeds, will produce a statistically similar but numerically different model. The “correct” output for a given input is not defined by a specification; it is approximated by the training distribution. Certification of software assumes you can prove the function is correct. ML asks you to prove the function is accurate enough, on the right distribution, with quantified uncertainty bounds. Those are different engineering tasks.

The traceability problem. MC/DC coverage of a neural network’s weights is not meaningful. You cannot trace the network’s output on a given input to a specific requirement that justified that output, because the model’s behavior emerges from the interaction of parameters that were learned, not specified. The code that runs the model is small and traceable. The model itself — the artifact that actually produces behavior — is not.

The completeness problem. Structural coverage argues that by exercising all paths, you have demonstrated correctness for all cases. An ML model’s input space is effectively unbounded. You cannot enumerate all inputs, and you cannot argue coverage in the DO-178C sense. You can argue statistical coverage over a test distribution, but that is a different argument, and the current standards have no mechanism for evaluating it.

None of this means ML cannot be made safe in aviation. It means the existing verification machinery does not apply without modification. New concepts — and new engineering processes — are required.

What EASA and the FAA Are Actually Doing

Both authorities have acknowledged the problem and published provisional guidance. Neither has published an approved certification path for safety-critical ML.

EASA’s Concept Paper and AI Roadmap. EASA published its first concept paper on ML in aviation in 2021, with a second issue following in 2023. The papers define the concept of a Machine Learning Constituent (MLC) — the trained model treated as a distinct artifact within a system — and introduce the idea that certification must address the full ML lifecycle: data collection, training, validation, deployment, and monitoring. EASA’s AI Roadmap, updated through 2025, frames the certification of AI/ML in aviation as a multi-year program, acknowledging that current guidance is exploratory and that the pathway to approving DAL A or DAL B ML functions remains open.

The most significant conceptual contribution from the EASA work is the definition of the Operational Design Domain (ODD) — the bounded set of conditions within which an ML model’s behavior has been characterized. The ODD is, in effect, the substitute for specification. You don’t prove the model is correct for all inputs; you define the envelope of inputs for which you have demonstrated adequate performance, and you require that the operational system monitor whether it remains within that envelope.

The FAA and AMOC. The FAA’s primary mechanism for addressing novel certification challenges is the Alternative Methods of Compliance (AMOC) process, which allows applicants to propose alternative means of satisfying existing regulations when standard compliance methods are inapplicable. Several programs are understood to be pursuing AMOC approvals for limited ML applications, but the approvals are program-specific and non-precedential — each program argues its own case. The FAA’s BEYOND initiative and ongoing standards work through RTCA are developing the longer-term framework, but published guidance applicable to safety-critical ML is not yet available.

Learning Assurance: Reframing Verification for ML

The term “learning assurance” has emerged — primarily from the EASA concept paper and subsequent industry work — to describe the structured argument that a trained ML model has adequate properties for its intended use. It is not a replacement for DO-178C; it is a parallel framework that addresses what DO-178C cannot.

Learning assurance addresses several concerns that conventional software verification does not:

Data assurance. The training dataset is, in a meaningful sense, the specification of an ML model’s behavior. Learning assurance requires that the dataset be characterized: its sources, its labeling process, its coverage of the ODD, its known biases, and its independence from the validation and test sets. Requirements on training data are first-class engineering requirements, not implementation details.

Model verification. Beyond accuracy metrics, learning assurance asks: does the model behave correctly at the boundaries of its ODD? Does its performance degrade gracefully as inputs approach ODD boundaries, or does it fail unpredictably? Are there known failure modes, and are they tolerable given the system safety analysis?

Monitoring requirements. Because an ML model cannot be fully verified at design time, learning assurance places obligations on runtime monitoring. The system must be able to detect when inputs are out-of-distribution, when model confidence is anomalously low, or when performance metrics collected in operation are diverging from design-time expectations. These monitoring requirements must be specified, traced, and verified — which puts them squarely in the domain of systems engineering, not just software engineering.

The learning assurance framework reframes the verification question from “does this code do what its specification says?” to “can we make a structured, documented argument that this model is safe enough, within this envelope, for this function?” That argument must be traceable, auditable, and updated as the model or its operating environment changes.

Requirements Management as the Entry Point to Certifiable ML Engineering

The certification gap is real and unresolved at the standards level. But engineering teams developing ML-containing systems cannot wait for finalized guidance. Programs have schedules. Aircraft have in-service dates. The responsible path is to begin building the engineering rigor that a future certification argument will require, starting now — with requirements.

Capturing requirements on ML model behavior is the first and most structurally important step. These requirements are different from conventional software requirements, but they are requirements: they are falsifiable statements about what the model must do, under what conditions, with what performance, and with what monitoring.

Specifically, the requirements that belong in a managed, traced artifact set for an ML constituent include:

Performance envelope requirements: accuracy, precision, recall, or domain-specific metrics at minimum acceptable thresholds, on defined test distributions, within the ODD boundary.
ODD boundary requirements: explicit definitions of the input conditions under which the model is expected to perform, including environmental conditions, sensor ranges, and exclusions.
Training data requirements: constraints on the composition, labeling quality, class balance, and ODD coverage of training data — treated as allocated requirements, not implementation notes.
Robustness requirements: behavior under input perturbations, sensor degradation, or edge-case scenarios that are outside nominal training distribution but within operational possibility.
Monitoring requirements: what the operational system must observe, what thresholds trigger alerts or safe-state transitions, and what constitutes evidence of in-service performance degradation.

Flow Engineering has been applied in this space by systems engineering teams who are, in effect, building the requirements architecture for a future certification argument before the certification path exists. The tool’s graph-based traceability model is well-suited to this problem because the relationships between these requirement types are not hierarchical — they are networked. A monitoring requirement traces to a performance requirement, which traces to an ODD boundary definition, which traces to a safety requirement from the system-level FMEA. In a document-based tool, those links are informal and fragile. In a graph model, they are first-class relationships that can be queried, validated, and updated as the engineering evolves.

What Flow Engineering brings to ML requirements specifically is the ability to treat training data constraints, ODD definitions, and monitoring obligations as traceable nodes in the same model that contains system-level safety requirements — so that when a system safety engineer changes a hazard classification, the downstream effects on model performance requirements and monitoring thresholds are visible and managed. That is the kind of rigor a learning assurance case will need to demonstrate.

Honest Assessment

We are not close to a routine certification path for safety-critical ML in aviation. The standards work is real, the regulatory engagement is genuine, and the conceptual frameworks — learning assurance, ODD, ML constituents — are technically sound. But EASA’s own roadmap acknowledges that full certification guidance for high-DAL ML functions is years away, and the FAA’s approach remains case-by-case.

What is achievable now is an engineering process that is ready for certification when the framework arrives: requirements on ML behavior that are specific, traced, and managed; data and training processes that are documented and reproducible; monitoring designs that are specified as engineering requirements, not afterthoughts.

The teams that will be first through certification will be the ones who started treating their ML constituents as managed engineering artifacts — with requirements, traceability, and change control — before the certification path was paved. The tools and practices for that exist today. The standard that formally recognizes them does not yet.

Can Machine Learning Models Be Verified Under Existing Aerospace Certification Standards?

Key Takeaways

Can Machine Learning Models Be Verified Under Existing Aerospace Certification Standards?

What DO-178C and ARP4754A Were Designed to Verify

Why ML Models Break These Assumptions

What EASA and the FAA Are Actually Doing

Learning Assurance: Reframing Verification for ML

Requirements Management as the Entry Point to Certifiable ML Engineering

Honest Assessment