Engineering AI-Enabled Diagnostics Under the FDA’s Evolving Framework
The FDA’s regulatory posture on AI-enabled medical devices has shifted from cautious observation to active framework-building. The 2021 action plan for AI/ML-based Software as a Medical Device, the predetermined change control plan guidance, and the proposed regulatory structure for continuously learning AI systems represent a genuine architectural change in what the FDA expects from a compliant engineering process — not just additional submission documentation.
Most medical device engineering teams are treating this as a regulatory affairs problem. It isn’t. It’s an engineering process problem that regulatory affairs can’t solve without engineering documentation that doesn’t currently exist at most organizations.
This article focuses on what the FDA is actually asking for at the engineering level: how requirements must be structured for algorithms that retrain post-market, how V&V planning has to change for adaptive systems, and what submission packages for Class II AI/ML SaMD actually need to contain from an engineering process standpoint.
What the FDA’s Framework Actually Demands Structurally
The FDA’s traditional device framework was built around a fixed specification reviewed once at 510(k) or PMA. The predicate device logic works when the device you submit is the device that ships and remains unchanged. AI diagnostics break this assumption because the clinical algorithm — the part that matters most for patient safety — is expected to change over time.
The FDA’s response was the Predetermined Change Control Plan (PCCP). A PCCP is not a regulatory waiver; it’s a pre-approved change management structure. It defines:
- The types of modifications the algorithm may undergo (retraining on expanded data, architectural updates, threshold adjustments)
- The performance boundaries within which those modifications are permissible without a new submission
- The testing and monitoring protocols that confirm a change stayed within bounds
- The transparency mechanisms that communicate algorithm updates to users
From an engineering standpoint, a PCCP is a living requirements envelope. You’re not specifying what the algorithm will do; you’re specifying the boundaries within which it’s allowed to evolve, the conditions under which it can evolve, and how you’ll verify that any given evolution stayed within those boundaries.
This is fundamentally different from how requirements are written for hardware or even for deterministic software. Traditional requirements are closed statements: the device shall detect ST-elevation with a sensitivity ≥ 92% under condition X. A PCCP-compatible requirement layer has to accommodate statements like: the algorithm’s sensitivity on the primary validation cohort shall not decrease by more than 2.5 percentage points following any retraining cycle, as measured against the locked reference dataset under protocol V-001.
The difference is subtle but structurally significant. The first requirement describes a threshold. The second describes a constraint on change — and that constraint has to be traceable through your entire V&V structure.
What Requirements Look Like for a Retrained Algorithm
Engineering teams that have built requirements for deterministic Class II devices typically organize their requirements documentation around three layers: system requirements, software requirements, and software design specifications. This structure doesn’t map cleanly onto an AI-enabled diagnostic with a PCCP.
For an AI diagnostic algorithm that will be retrained post-market, you need at minimum four requirement layers:
1. Clinical Performance Requirements These are the floor. They define the minimum acceptable sensitivity, specificity, PPV, NPV, and AUC for the algorithm’s intended use population. They must be stable — retraining should never push the algorithm below these thresholds. These requirements anchor the PCCP boundary conditions and must be traceable to your clinical risk analysis.
2. Algorithm Architecture Constraints These define what structural properties of the algorithm are fixed and what are variable. A requirement in this layer might specify that the model architecture class (e.g., convolutional neural network with no more than N layers) is frozen for the PCCP period, while hyperparameters are free within stated ranges. Architecture constraints connect clinical performance requirements to the technical retraining protocols.
3. Training Data Governance Requirements These are often omitted from traditional requirements documentation because there’s no analog in hardware engineering. For a retrained algorithm, you need requirements that specify the minimum size, demographic composition, annotator qualification criteria, and labeling protocol for any dataset used in a retraining cycle. The FDA’s guidance on Good Machine Learning Practice (GMLP) is explicit that training data management is a quality system function, which means it needs documented requirements just like any other controlled process.
4. Change Control Performance Requirements These are the PCCP-specific layer. They define the measurable criteria that a retrained model must meet before it can be deployed as a cleared modification. They typically include statistical comparisons against the locked baseline model on the reference validation set, with pre-specified equivalence margins.
The challenge is that these four layers need to be connected. A change in the training data governance requirements (say, adding a new patient demographic) must trigger a review of whether the clinical performance requirements remain achievable, which in turn may affect the architecture constraints. Without traceable connections between these layers, your PCCP documentation will have gaps that FDA reviewers will find.
How V&V Planning Must Change for Adaptive Systems
Verification and validation for a fixed medical device follows a predictable structure: requirements are locked, test cases are written to those requirements, the device is tested, and the outputs are documented. For an adaptive AI system under a PCCP, this structure needs to be extended in three ways.
Baseline Model V&V This is closest to the traditional structure. The initial algorithm submission requires the same level of V&V documentation as any Class II software: software hazard analysis, software requirements specification, software design document, and a complete software verification and validation plan and report. The FDA’s guidance on software functions in medical devices (the 2023 final guidance superseding the 2005 guidance) applies in full.
Change Protocol V&V This is new territory for most teams. Every modification type described in the PCCP needs a corresponding test protocol that will be executed prior to deploying the modified algorithm. These protocols must be pre-specified — the FDA does not want you designing the test after you’ve seen the retrained model’s results. The pre-specified protocol should include: the reference dataset to be used, the statistical test for equivalence or non-inferiority, the decision threshold that determines pass/fail, and the documentation artifacts to be generated. This is a prospective V&V protocol, written before any specific retraining cycle begins, that governs all future retraining cycles of that modification type.
Continuous Monitoring V&V Once a modified algorithm is deployed, the FDA expects monitoring data to confirm that real-world performance remains within the PCCP boundaries. This isn’t passive surveillance; it’s an active requirement to collect structured performance data, compare it to the pre-deployment reference, and trigger a formal review if performance degrades beyond a pre-specified threshold. This monitoring protocol needs to be designed as part of the PCCP, not retrofitted after deployment.
The engineering implication is that your V&V planning effort doesn’t end at initial submission. It expands, because each retraining cycle must be documented against the pre-specified change protocol, and monitoring data must be reviewed on a defined schedule. Teams that don’t build this into their quality system infrastructure before submission will find themselves creating it under time pressure when the first post-market retraining cycle arrives.
What FDA Submissions for Class II AI/ML SaMD Actually Need
FDA reviewers working on AI/ML SaMD submissions have become substantially more specific in their expectations over the past 24 months. Teams encountering their first AI diagnostic submission often underestimate the documentation requirements not for the algorithm itself, but for the engineering process that produced and will continue to govern the algorithm.
Based on publicly available additional information (AI) letters and FDA workshop materials, the most common engineering documentation gaps in Class II AI/ML SaMD submissions are:
Insufficient traceability between clinical claims and algorithm design inputs. A device that claims to detect early-stage diabetic retinopathy must trace that clinical claim through to specific labeling criteria, the training data composition, and the architecture decisions that enable detection at the claimed performance level. When this chain is broken or implicit, reviewers ask for it explicitly — and reconstructing it post-development is difficult.
PCCP modification descriptions that are too vague to be verifiable. Phrases like “the model may be retrained on additional data to improve performance” are not acceptable PCCP language. The FDA needs to know: what constitutes additional data (governed by which data governance requirements), what performance improvement means (measured by which metric against which benchmark), and what would trigger a stop decision.
Missing reference to GMLP principles in software development documentation. The FDA’s 2021 GMLP action items have effectively become de facto expectations. Submissions that don’t address data management practices, model transparency, human-AI team performance, and algorithm bias assessment are being flagged.
Inadequate description of the human-AI interface in the context of the PCCP. If the algorithm changes, the labeling and operator guidance may also need to change. FDA submissions need to describe how algorithm updates will be communicated to users and whether updated training or labeling is required.
The Tooling Problem
The engineering documentation requirements for AI/ML SaMD under a PCCP exceed what most medical device teams’ current tooling was designed to support.
Traditional requirements management tools — IBM DOORS, Polarion, Jama Connect — were built for waterfall, fixed-specification development. They handle traceability between requirements layers well, but they weren’t designed to manage requirements that describe behavioral envelopes, or to trace algorithm design decisions back through training data governance requirements to clinical risk analysis. The disconnect becomes obvious when you try to write a change impact analysis for a retrained model: you need to trace from the performance delta in the new model back through every requirement it implicates, and the tooling to do that needs to understand the semantic relationship between those layers, not just their link structure.
This is where graph-based requirements infrastructure has a practical advantage over document-based systems. In a graph-based model, requirement nodes carry attributes and relationships that can be queried — which requirements are affected by a change in the training data governance layer, which test cases are linked to performance boundary requirements, which architecture constraints are upstream of a clinical claim. For PCCP-based development, being able to answer those queries quickly isn’t a nice-to-have; it’s what makes change impact analysis tractable.
Flow Engineering, which is built specifically for hardware and systems engineering workflows, has been used by medical device teams to structure exactly this kind of multi-layer, graph-connected requirements model for AI diagnostic programs. Its approach — treating requirements as nodes in a connected model rather than rows in a document — fits the PCCP’s structural demands better than tools that generate requirements matrices from Word-style documents. It’s not a regulatory compliance tool, but it provides the traceability infrastructure that makes FDA-required documentation producible rather than reconstructable.
The practical question for engineering teams isn’t which tool to use; it’s whether their current tooling can support a change impact query that crosses from a clinical performance requirement through algorithm architecture constraints into training data governance — and produce a defensible audit trail for that query. If it can’t, the PCCP documentation will require manual effort that introduces both error and delay.
What Teams Should Actually Do Now
The FDA’s AI/ML framework is still evolving — the proposed order for Artificial Intelligence-Enabled Device Software Functions published in 2024 is not yet final guidance. But the direction is clear and the documentation expectations are already affecting submissions.
Three things engineering teams can do now, regardless of submission timeline:
Audit your requirements structure against the four-layer model. If your AI diagnostic program’s requirements documentation doesn’t distinguish between clinical performance requirements, architecture constraints, training data governance requirements, and change control performance requirements, you have a gap that will surface in submission review.
Write your change protocol before your first retraining cycle. Even if a PCCP isn’t required for your specific submission path, pre-specifying how you will evaluate a retrained model forces you to surface ambiguities in your baseline V&V structure that would otherwise create problems post-market.
Assess whether your traceability tooling can support bidirectional change impact analysis across your requirement layers. If it can’t, the documentation burden for a PCCP-governed program will land on individuals rather than on systems — which is both a quality risk and a scaling problem as the algorithm evolves.
The FDA isn’t asking medical device engineering teams to become machine learning researchers. It’s asking them to apply rigorous engineering discipline to a class of systems that changes post-deployment — and to document that discipline in a way that supports ongoing review. The teams that will navigate this most efficiently are the ones that treat it as an engineering infrastructure problem before it becomes a submission problem.