The Rise of AI-Augmented V&V: How Verification and Validation Is Changing in Safety-Critical Programs
Verification and validation in safety-critical programs has always been expensive in a very specific way: not primarily because testing infrastructure is costly, but because the cognitive labor of planning, documenting, reviewing, and defending test coverage is relentless. A single DO-178C Level A software component can require hundreds of test cases, each of which needs a documented rationale, a traceability link to requirements, and a record that a qualified engineer reviewed it. Multiply that across a modern avionics suite or an ADAS stack, and the overhead becomes the program.
That overhead is where AI is starting to make a measurable difference. Not by replacing the engineer who signs the certificate, but by compressing the hours between “we have requirements” and “we have defensible test coverage.” The change is real, it is uneven across industries, and it comes with genuine constraints that optimistic vendor marketing tends to understate.
This article examines where AI-augmented V&V is actually working, where the regulatory framework creates hard limits, and what teams operating under established safety standards need to understand before they change their processes.
What “AI-Augmented V&V” Actually Means
The phrase covers a wide range of capabilities, and conflating them leads to poor decisions. For practical purposes, there are four distinct application areas:
Test planning assistance. AI analyzes requirements—their structure, coverage intent, and relationships—and suggests which verification methods apply, which requirements cluster into logical test groups, and where the test plan has gaps relative to the standard being applied.
Test procedure generation. Given a requirement and context about the system, an AI model drafts test steps, expected conditions, pass/fail criteria, and traceability tags. Engineers review and accept or modify rather than authoring from scratch.
Coverage gap detection. AI examines the relationship between requirements and existing test cases, identifying requirements with no test coverage, test cases with no traceable requirement, and structural gaps in the coverage argument.
Results analysis and anomaly detection. AI processes test execution logs, flags anomalies relative to expected behavior, and surfaces potential failures that manual review of large result sets might miss.
These are different problems with different risk profiles. Test procedure generation feels impressive but sits entirely within the human review loop—an engineer still validates every word before it goes into the record. Results analysis is the area with the most caution warranted: if AI is making decisions about what constitutes a passing result, the human oversight model gets complicated fast.
Where AI-Augmented V&V Is Gaining Traction
Aerospace Under DO-178C
Software verification under DO-178C is structured enough that AI has good surface area to work with. The standard defines objectives for each software level, and those objectives map predictably to test case types: requirements-based testing, structural coverage analysis, independence requirements. That structure makes it possible to train or prompt models that understand what “MC/DC coverage at Level A” actually demands.
The practical wins are appearing in two places. First, test case authoring time. Teams using AI-assisted drafting report 40–60% reductions in first-draft authoring hours for requirements-based test cases. That does not mean 40–60% of the work disappears—review, negotiation with systems engineers, and DER interaction still take substantial time—but the cognitive load of staring at a requirement and translating it into structured test steps is reduced.
Second, bidirectional traceability verification. Traceability matrices in large avionics programs are notoriously hard to keep current. AI tools that crawl requirement sets, test case databases, and change logs can surface traceability breaks within hours that would previously require days of manual audit. This is unglamorous work, but it is the kind of error that causes expensive late-cycle findings.
ARP4754A, which governs system-level V&V for civil aerospace, is showing similar patterns at the system requirements level, particularly in the analysis of safety requirements allocation and the identification of derived requirements that lack clear test rationale.
Automotive Under ISO 26262
The automotive domain has a structural advantage in AI adoption for V&V: the prevalence of model-based development and simulation. ISO 26262-compliant programs are already generating enormous volumes of test results from hardware-in-the-loop (HIL) and software-in-the-loop (SIL) environments. AI applied to that result volume—detecting anomalies, classifying failure signatures, prioritizing which test results need engineer attention—has cleaner integration paths than in document-heavy aerospace workflows.
ADAS and autonomous driving programs are the most active adopters. The combination of high requirement volumes, frequent design changes, and intense regulatory scrutiny (from UN Regulation 157 as well as 26262) is creating genuine demand for tools that can maintain coverage arguments through continuous change. Several Tier 1 suppliers have publicly described AI-assisted coverage gap analysis as part of their development workflows, though the toolchain qualification questions (ISO 26262 Part 8 for tool confidence level) are still being worked through.
Defense
Defense programs operate under a patchwork of standards—MIL-STD-882E for system safety, DO-178C for airborne software, specific JCIDS and test agency requirements—and the acquisition cycle is long enough that toolchain changes are expensive. Adoption of AI-augmented V&V in defense is more selective, concentrated in programs where commercial aerospace or automotive toolchains are already present, or where new MOSA-aligned programs have more flexibility to specify tooling.
The area with the most traction is test planning for software-intensive subsystems, where the volume of requirements in a DOORS export or a requirements management database makes manual coverage analysis impractical at scale.
The Regulatory Acceptance Gap
Here is the honest version of where things stand: none of the major safety standards—DO-178C, ISO 26262, or ARP4754A—explicitly address AI as a verification participant. That creates an interpretive problem with real consequences.
Under DO-178C, verification tools that make pass/fail determinations on software need to be qualified under DO-330. An AI model that generates test cases is arguably a development tool, not a verification tool, and may not require full qualification—but the moment that model’s output becomes part of the evidence being submitted to a DER or certification authority, the “how do you know this is correct” question has to be answered. The current answer, in almost every program, is “a qualified engineer reviewed and approved it.” That answer is acceptable today, but it means AI is functioning as an authoring aid, not a verification participant.
ISO 26262 Part 8 provides a tool confidence level framework (TCL 1–3) that applies to tools used in safety-related activities. An AI-assisted test case generator affecting ASIL C or ASIL D software is going to trigger tool qualification questions that most AI vendors are not yet positioned to answer with validated evidence.
EASA has been more proactive than FAA in publishing guidance on AI in aviation—their AI Roadmap and Concept Paper on ML in Airborne Systems (AMC 20-AI) are the clearest regulatory positions currently available—but even that guidance stops well short of qualifying AI as an independent verification mechanism.
The practical implication: AI-augmented V&V is operating in a space where it is tolerated and useful but not yet formally recognized. Programs need to structure their use of AI tools so that all certification evidence can be defended as the output of qualified human judgment, with AI in a supporting role. That is not a reason to avoid these tools—it is a reason to be precise about where they sit in the process.
What AI Cannot Yet Do in V&V
This deserves direct treatment, because the marketing around AI in safety-critical engineering tends toward enthusiasm.
AI cannot make test philosophy decisions. Deciding what constitutes adequate coverage for a novel failure mode—especially one without historical data—requires engineering judgment that AI does not possess. AI can tell you which requirements lack test cases. It cannot tell you whether the existing test cases actually probe the right failure conditions for a system behavior the development team is still discovering.
AI cannot replace the qualified independent reviewer. Independence requirements under DO-178C and ISO 26262 specify that verification activities must be performed by someone other than the developer. AI tools trained on a project’s own data do not satisfy independence—they are, in a meaningful sense, extensions of the development process.
AI cannot reason reliably about physical system behavior it has not been exposed to. Test procedure generation works reasonably well for software requirements with clear behavioral semantics. It works poorly for system-level tests involving hardware integration, environmental conditions, or failure injection scenarios where the test setup itself requires domain expertise to design correctly.
AI results analysis cannot yet be trusted without human validation on safety-critical outputs. Pattern recognition on test logs is useful for prioritization and anomaly flagging. It is not a substitute for an engineer understanding why a result is anomalous and whether it represents a genuine defect or a test environment artifact.
Where Requirements Structure Determines AI Value
Teams getting meaningful results from AI in V&V share a common characteristic: their requirements are structured well enough for AI to parse. This sounds obvious, but it is the single largest determinant of whether AI tools deliver value or produce noise.
Requirements written as dense paragraphs with embedded rationale, assumptions buried in headers, and compliance-speak layered over actual system behavior are difficult for AI to analyze correctly. Requirements expressed with clear scope, measurable acceptance criteria, and explicit system states produce dramatically better AI-assisted test case drafts and more accurate coverage gap detection.
This is where graph-based requirements management is becoming a genuine differentiator. When requirements exist in a connected, structured model—with explicit relationships between stakeholder needs, system requirements, derived requirements, and verification methods—AI tools can traverse those relationships and produce analysis that is both more accurate and more explainable. Tools like Flow Engineering are built around this model-first architecture, giving AI assistants clean, queryable structure to work with rather than document contents scraped from Word or PDF exports. Teams using that kind of foundation report that AI-generated test suggestions have higher first-pass acceptance rates and require less correction, because the underlying requirements are semantically richer.
The contrast with legacy document-centric workflows is stark. Exporting requirements from a traditional tool into a flat spreadsheet and feeding that to an AI assistant is technically possible, but the result quality reflects the impoverished structure of the input.
What Good Adoption Looks Like
Programs making productive use of AI in V&V tend to share several practices:
They define the human review checkpoint explicitly. Every AI-generated test case, coverage analysis, or anomaly flag has a defined review step, owned by a named role, before it enters the formal record. The AI output is evidence of what the AI suggested; the engineer’s approval is evidence that the requirement is actually verified.
They start with low-risk test types. Requirements-based test case drafting for non-safety-critical software, or structural coverage analysis for lower ASIL levels, generates value and builds team familiarity before the approach moves into DAL A or ASIL D territory.
They document their AI tool use in their plans. The Software Verification Plan or V&V Plan should describe what AI tools are used, in what role, and what human oversight applies. This is the answer to “how would you explain this to your DER or certification authority.”
They measure actual cycle-time impact. The teams with the most credible stories about AI in V&V are tracking hours-per-verified-requirement before and after AI introduction. The teams with the least credible stories are repeating vendor benchmarks.
Honest Assessment
AI-augmented V&V is not hype. The cycle-time reductions in test authoring and the accuracy of coverage gap detection are real, measurable, and already operating in programs under active certification. The tools are most effective when requirements are well-structured, when the human review model is explicitly defined, and when no one is asking AI to make the judgment calls that certification authorities will eventually scrutinize.
The limits are equally real. Regulatory frameworks are behind the technology, tool qualification for AI is unresolved, and the cognitive tasks that require genuine domain expertise—test philosophy, failure mode reasoning, independent judgment—remain human. That boundary is not going to move quickly, and programs that pretend it doesn’t exist are building arguments that will not survive a certification audit.
The productive path is narrower than the marketing suggests and wider than the skeptics allow: use AI to compress the labor-intensive authoring and auditing work, keep humans in every judgment-critical role, and build on requirements structures that give AI tools something coherent to analyze.