The Autonomous Vehicle Safety Case Crisis: Why AVs Still Cannot Be Certified at Scale

Waymo has logged tens of millions of driverless miles. Baidu Apollo Go operates robotaxi fleets in multiple Chinese cities. Cruise, before its 2023 regulatory suspension, was expanding commercially in San Francisco. By any headline measure, autonomous vehicles are real, operational, and accumulating experience.

Yet no fully autonomous vehicle—SAE Level 4 or above, without operational design domain restrictions that effectively substitute geography for safety argument—has achieved broad regulatory certification in any major market. Not in the United States, where the NHTSA’s AV framework remains a voluntary guidance document. Not in the EU, where UNECE WP.29 has published framework regulations for automated lane keeping but stops well short of full autonomy. Not in China, where cities issue local permits but no national certification pathway for Level 4 exists.

This is not a story about technology failing to mature. The technology has matured considerably. This is a story about a certification methodology crisis—a fundamental gap between what safety regulators require to grant broad market approval and what AV developers are currently able to demonstrate.

The Probabilistic Argument Problem

The dominant safety certification paradigm in automotive and aviation is probabilistic. You define a hazardous event, assign it a severity level, and demonstrate that the system achieves a quantitative probability target—typically expressed as failures per operational hour. ISO 26262 and its derivatives use this framework extensively. It works well when failure modes are enumerable: hardware faults, software errors in deterministic code, sensor degradation. You can build fault trees, analyze FMEA tables, run accelerated life testing, and accumulate field evidence.

AI-driven perception and decision systems break this framework at the foundation. The failure mode space is not enumerable. A neural network trained on hundreds of millions of labeled examples can fail on an input configuration that has never appeared in any training or validation set—not because of a hardware fault or a coding error, but because the input is genuinely novel. The system does not fail in a mode you can trace to a root cause and assign a rate. It fails because the world produced something the model had not encountered.

This is not a solvable problem within the current probabilistic paradigm. The mathematics require that you can either analytically bound the failure space or empirically estimate the failure rate from a representative sample. Neither condition holds for deep learning perception in open-world driving.

The most commonly cited workaround is the RAND Corporation’s 2016 estimate that an AV fleet would need to drive approximately 275 million miles to demonstrate, with 95% confidence, a fatality rate lower than human drivers. Updated analyses, accounting for tail risks and rare event distributions, have pushed this figure above 10 billion miles. Waymo’s cumulative mileage is impressive by any AV industry standard. It is statistically negligible against this requirement.

The SOTIF Challenge Nobody Has Solved

ISO 21448, the Safety of the Intended Functionality standard, was published in 2022 specifically to address the category of harm that occurs when a system operates exactly as designed but its performance limitations cause a hazardous situation. SOTIF captures the failure mode that pure fault-analysis misses: not malfunction, but inadequacy.

A camera-based perception system that correctly identifies all objects it has been trained and validated to identify, but fails to detect a child on a micro-mobility device in an unusual configuration, is not malfunctioning under ISO 26262. It is performing within its design specification and failing under SOTIF. The distinction matters enormously for certification because the engineering response is different. You cannot fix a SOTIF failure by improving hardware reliability or increasing software rigor—you have to expand the system’s competence boundary, and competence boundaries for open-world perception are not definable by any current method.

ISO 21448 describes a process for identifying and reducing SOTIF risk: trigger condition analysis, scenario testing, validation coverage expansion. What it does not specify is what completion looks like. There is no acceptance criterion in the standard—no quantitative threshold at which a SOTIF analysis is considered sufficient for certification. The standard explicitly acknowledges this gap, noting that guidance for complex sensors and machine learning applications requires further development.

Regulators have inherited this ambiguity. NHTSA’s current AV guidance references SOTIF concepts but provides no certification pathway based on them. The EU’s UNECE WP.29 work on automated driving includes SOTIF references but has not resolved the acceptance criterion problem. Until there is a defined endpoint, developers cannot complete a SOTIF argument, and regulators cannot approve one.

Why Disengagement Data Tells You Almost Nothing

California’s DMV disengagement reporting requirement has generated the closest thing the industry has to a public safety metric. Every permitted AV operator files annual reports documenting the number of miles driven and the number of times a human safety driver took control. These reports are widely cited, analyzed, and ranked. Waymo consistently leads. Cruise and Zoox trail further behind. Commentators use them as a proxy for system maturity.

Disengagement data measures operator behavior and system confidence calibration, not safety outcomes. When a safety driver disengages in response to a perception anomaly, they may be preventing a safety-critical event—or they may be taking over because the system hesitated at a complex merge. Companies with more conservative takeover thresholds produce more disengagements. Companies that tune their systems to project confidence and avoid operator intervention produce fewer disengagements, with no necessary relationship to actual safety performance.

The Cruise pedestrian dragging incident in October 2023 illustrated this precisely. The vehicle that struck and dragged a pedestrian had a disengagement profile consistent with a well-performing system. The disengagement metric had no predictive value for the specific failure that occurred. NHTSA’s subsequent investigation focused on sensor interpretation and post-crash behavior—neither of which appears in disengagement reports.

A credible safety metric for an AV system would require at minimum: a defined operational design domain with sharp boundaries, a validated hazard taxonomy, and outcome tracking against that taxonomy normalized by exposure. No major market currently requires this.

Three Incompatible Regulatory Frameworks

The US approach is permissive and fragmented. Federal jurisdiction covers vehicle standards; state jurisdiction covers licensing and operational permits. NHTSA has repeatedly declined to issue prescriptive AV safety standards, citing the pace of technology change and the risk of locking in premature requirements. The result is that a company can operate commercially in states with permissive regimes—California, Texas, Arizona—while the federal framework that would enable broad market access remains a voluntary guidance document. AV3.0, AV4.0, and the NHTSA’s 2023 automated vehicle transparency and engagement for safe testing proposals have generated comment periods and stakeholder engagement without producing mandatory requirements.

The EU approach is process-based and increasingly detailed. UNECE WP.29 Working Party on Automated/Autonomous and Connected Vehicles has produced regulations covering specific automated functions—ALKS for motorway use, specific remote driving frameworks—with an architecture that assumes type approval by function rather than by vehicle. The EU AI Act, which applies to high-risk AI systems and explicitly covers AI used in vehicles, introduces additional requirements for transparency, human oversight, and conformity assessment that are not aligned with the automotive type approval process. A Level 4 AV developer seeking EU market access must navigate WP.29 vehicle regulations, EU AI Act obligations, and member-state operational licensing—three partially overlapping frameworks with different competent authorities and different evidence requirements.

China’s approach is city-led and government-integrated. Beijing, Shanghai, Wuhan, and Shenzhen have issued operating permits for Level 4 robotaxis through local road testing and commercial pilot programs. The national standard GB/T 40429 defines automation levels, and a national AV access management regulation was issued in 2023 covering smart connected vehicles, but the pathway from local permit to national certification remains under construction. The practical effect is that Chinese AV companies—Baidu, WeRide, AutoX—can operate at commercial scale within permitted cities and have done so, but a unified national safety case framework that would generalize across markets does not exist.

A safety case accepted by California’s DMV provides no leverage in the EU type approval process. Evidence packages built to satisfy UNECE WP.29 ALKS requirements are irrelevant to the Chinese national standard review. Any company seeking genuinely global market access must construct parallel safety cases for parallel frameworks—a massive duplication of effort that also surfaces the underlying problem: there is no consensus on what constitutes a valid safety argument for these systems.

What a Credible Safety Case Would Actually Require

Practitioners working on this problem—teams at Waymo, Mobileye, TÜV SÜD, elements of the Aptiv and Bosch safety engineering functions—have converged on several components that any credible Level 4 safety case must contain, regardless of which regulatory framework it targets.

A formally bounded Operational Design Domain. Not a prose description. A machine-readable, auditable specification of the conditions under which the system is claimed to be safe, with explicit exclusion criteria. Speed ranges, road type classifications, weather and lighting parameters, traffic density bounds. The ODD must be tight enough to be testable and specific enough that a regulator can determine whether the vehicle was operating within it at the time of any incident.

A system-theoretic hazard analysis. STPA (System-Theoretic Process Analysis) or equivalent approaches that can identify unsafe control actions in complex sociotechnical systems—not just component failure chains. FMEA alone is insufficient for systems where hazards emerge from interactions between components that are individually functioning correctly.

Scenario-based validation with coverage arguments. Not just miles accumulated, but structured evidence that the scenario space relevant to the ODD has been systematically identified, prioritized by risk, and tested to defined pass/fail criteria. This requires a scenario taxonomy, a sampling methodology, and a coverage argument explaining why the tested scenarios are representative. This is the area where current approaches are most visibly inadequate—most validation strategies are either exhaustive for a narrow ODD or statistically unjustifiable for a broad one.

Behavioral competency certification for the AI stack. Evidence that the perception, prediction, and planning modules meet defined performance requirements within the ODD—including performance under distributional shift and adversarial conditions. This is not solved by overall system testing. The AI components require separate competency characterization.

A living safety case with change impact tracking. AV software stacks update continuously. A static safety case document becomes invalid the moment the system changes. A credible safety case must be structured as a living argument with explicit links between claims, evidence, and the system elements they support—so that any change triggers an automated assessment of which safety claims require re-substantiation.

This last requirement is where tooling matters. Teams still managing safety cases in document management systems—maintaining requirements in DOORS exports, arguments in Word documents, evidence in disconnected test databases—cannot produce a living safety case with change impact tracking at scale. The maintenance burden alone defeats the purpose. Organizations making the most progress have moved to model-based approaches where hazard analyses, requirements, and validation evidence exist as connected graph structures with machine-readable traceability.

Tools like Flow Engineering, built specifically for hardware and systems teams working with complex requirement graphs, enable this kind of connected, AI-assisted traceability—where a change to a system requirement automatically surfaces which safety claims downstream are affected and what re-validation is required. This is not a marginal workflow improvement. For a complex AV safety case with thousands of requirements and hundreds of claim-evidence links, the difference between graph-based and document-based management is the difference between a safety case that can be maintained and one that cannot.

Which Organizations Are Closest

Waymo operates the most mature publicly documented safety program in the Western market. Their 2023 Waymo One safety report describes a structured safety case framework, ODD documentation, behavioral safety principles, and a multi-layer validation approach. Crucially, they have published specific quantitative comparisons of Waymo Driver injury rates against human baseline rates in comparable conditions in Phoenix and San Francisco—the closest any company has come to the kind of outcome-normalized safety metric that a serious certification would require. Their approach is not complete by any rigorous standard, but it is further developed than any competitor.

Mobileye’s RSS (Responsibility-Sensitive Safety) framework provides a formal, mathematically defined model for safe driving behavior that could in principle support a certifiable safety argument for specific functions. Its limitation is that it addresses the planning layer’s behavioral correctness, not the perception layer’s competency—which is where most hard certification problems live.

In China, Baidu’s Apollo platform has the broadest operational deployment and has worked most extensively with municipal regulators to establish safety validation frameworks. Their experience has influenced the national standard development process. Whether this translates to a globally credible safety case is a different question.

The Honest Assessment

AV technology is real and, within narrow operational design domains and favorable operating conditions, demonstrably safer than average human drivers in specific tasks. That is genuine progress and should not be minimized.

The certification crisis is real. The probabilistic framework that regulators understand and trust does not apply to AI perception systems operating in open-world conditions. The SOTIF standard that addresses the right problem has not specified what resolution looks like. The primary public safety metric—disengagement data—measures the wrong thing. And three major regulatory frameworks are diverging rather than converging.

Progress will come from two directions simultaneously: regulators developing acceptance criteria that are demanding but tractable, and developers building safety cases that can actually be maintained, updated, and audited as living documents. Neither direction is moving fast enough. But the organizations that get there first will have one thing in common: they will have abandoned the document-centric safety case model that the industry inherited from aerospace programs that were designed, built, and certified once. AV safety cases are continuous engineering artifacts. The tools and methods that treat them as anything else are the wrong tools.