Managing “Model Drift”: Post-Market Monitoring Requirements for 2026

Anna Lisowska

⚡ TL;DR

  • Article 72 of the EU AI Act requires providers of high-risk AI systems to have a post-market monitoring plan in place from the day the system goes live — not as a future governance aspiration.
  • Model drift — the degradation of a model’s real-world performance relative to its validated baseline — is the primary technical phenomenon that post-market monitoring is designed to detect. Left undetected, it creates simultaneous compliance failure and harm to affected individuals.
  • A compliant post-market monitoring plan has three non-negotiable elements: proactive performance tracking with pre-documented thresholds, systematic incident capture with regulatory reporting workflows, and corrective action governance linking observations to documented responses.

AI systems do not stay the way you validated them. This is not a failure of engineering — it is an inherent property of machine learning systems deployed in dynamic real-world environments. The population that interacts with the system shifts. The upstream data sources that feed it change. The social, economic, or behavioural patterns the model was trained to recognise evolve. And over time, the gap between the world the model learned and the world the model operates in grows — a phenomenon the ML community calls model drift, and one that the EU AI Act’s post-market monitoring framework is specifically designed to catch.

Article 72 makes post-market monitoring a legal obligation rather than engineering best practice. A high-risk AI system whose provider cannot demonstrate active performance monitoring — with documented thresholds, incident records, and corrective action history — is not compliant, regardless of how well it performed at the time of its conformity assessment. This post gives you the complete framework: the types of drift you must monitor, the monitoring architecture that produces auditable evidence, and the incident and reporting workflows that Article 72 and Article 73 require.

For the broader audit-readiness context, see our pillar guide: Audit-Ready AI: The Step-by-Step Guide to Passing a Conformity Assessment. For the Technical File documentation that your monitoring programme feeds, see our Article 11 & Annex IV guide.

What the EU AI Act Requires Under Article 72

Article 72(1) requires providers of high-risk AI systems to establish a post-market monitoring system — a structured process that “actively and systematically collects, documents and analyses relevant data provided by deployers and, where applicable, users, in order to evaluate whether the system continues to comply with the requirements set out in Chapter III Section 2 over its lifetime.”

Three obligations compound the baseline monitoring requirement:

  • The plan must predate the system’s launch. Article 72(1) requires the post-market monitoring system to be established as part of the provider’s quality management system (Article 17). This means the monitoring architecture — dashboards, alert thresholds, incident capture workflow — must be built and documented before the first deployer goes live. A monitoring plan written after an incident is not a monitoring plan; it is an incident report.
  • Serious incidents must be reported externally. Article 73 requires providers to report any serious incident — meaning an incident that directly or indirectly caused death, serious harm to health, or serious disruption of critical services — to the market surveillance authority of the member state where the incident occurred within 15 days of becoming aware of it (3 days for life-threatening incidents or incidents involving critical infrastructure).
  • Corrective actions must be documented and linked. Where monitoring reveals that the system is no longer compliant with Chapter III requirements, the provider must immediately implement corrective actions, update the Technical File, and notify deployers and, where applicable, the EU database.

Understanding Model Drift: The Three Types Your Monitoring Must Cover

Model drift is not a single phenomenon — it is a family of related degradation patterns that manifest differently and require different detection approaches. A post-market monitoring plan that covers only one type while missing the others creates systematic blind spots.

Type 1: Data Drift (Covariate Shift)

Data drift occurs when the statistical distribution of the model’s input features shifts away from the distribution in the training data — without any change in the underlying relationship between inputs and outputs. The model’s learned function is still valid; it is simply being applied to a population it was not calibrated for.

Example: A credit scoring model trained on pre-pandemic financial behaviour data is deployed into a post-pandemic lending environment where applicant income profiles, employment stability patterns, and spending behaviours have shifted substantially. The model’s learned associations may no longer reflect the current relationship between features and creditworthiness.

Detection approach: Monitor the distributional statistics (mean, variance, percentile distributions) of key input features in production against the same statistics in the training dataset. Tools like Evidently AI and whylogs provide open-source data drift detection with automated statistical tests (Population Stability Index, Kolmogorov-Smirnov test, Jensen-Shannon divergence) that flag significant distributional shifts before they cause performance degradation.

Type 2: Concept Drift

Concept drift is more insidious than data drift: it occurs when the underlying relationship between inputs and outputs changes — meaning the model’s learned function itself becomes incorrect, not just its input distribution. The patterns the model was trained to identify no longer hold in the current world.

Example: An employee performance assessment model trained on historical performance data in a physical office environment is applied to remote and hybrid workers. The relationship between measurable proxy features (office attendance, in-person meeting participation, certain communication patterns) and actual performance has fundamentally changed — the model’s learned associations are now systematically wrong for a significant portion of its deployment population.

Detection approach: Concept drift requires ground truth data — you need to know actual outcomes to compare against model predictions. Where ground truth is available with reasonable latency (loan defaults within 90 days, hiring decisions with 6-month performance review follow-up), monitor prediction error rates over rolling windows. Where ground truth is delayed or unavailable, monitor for proxy signals: increasing override rates by human operators, increasing complaint rates from affected individuals, and increasing divergence between model confidence and human reviewers’ agreement rates.

Type 3: Demographic Drift

The most legally consequential form of drift for EU AI Act compliance: a shift in the model’s performance differential across demographic groups, even when aggregate performance appears stable. A system whose overall accuracy is unchanged but whose false-positive rate for a protected demographic group has increased threefold is experiencing demographic drift — and has acquired a discrimination risk that did not exist at the time of its conformity assessment.

This drift pattern is particularly likely when: the demographic composition of the production population shifts away from the training set; upstream data sources change in ways that introduce new proxies for protected characteristics; or the societal context changes in ways that affect how AI outputs interact with specific demographic groups.

Detection approach: Maintain disaggregated performance monitoring — the same fairness metrics you documented in your Technical File (disparate impact ratio, equalised odds differential, predictive parity gap) monitored continuously in production against the documented baselines. This is not optional: Article 72’s requirement to monitor whether the system “continues to comply” with Chapter III requirements includes the Article 10 data governance and bias requirements — meaning demographic drift that increases disparate impact is a compliance failure requiring corrective action.

The Post-Market Monitoring Architecture

A compliant monitoring architecture has four integrated components that work together to detect drift, generate evidence, and trigger corrective action.

Component 1: The Performance Baseline Registry

Before deployment, document the performance baseline that post-market monitoring will be measured against. This is the single most important document in your monitoring plan — because without a documented baseline, you have no way to demonstrate that observed performance represents a change rather than the expected performance level.

The baseline registry should capture, for each high-risk AI system: all performance metrics documented in the Technical File (Annex IV §4) at their validated levels; disaggregated metrics by every demographic group your bias testing covered; input feature distribution statistics that serve as the reference for data drift detection; and the alert thresholds for each metric — the level at which drift triggers escalation. Thresholds must be set and documented before deployment, not after drift is observed. See our post on building an immutable audit trail for how to store baseline records with tamper-evidence.

Component 2: Continuous Monitoring Dashboards

The monitoring infrastructure must provide continuous visibility into production performance, not just periodic snapshots. In practice, “continuous” means monitoring at a frequency appropriate to the system’s decision volume and risk level — daily for high-volume systems making consequential decisions, weekly for lower-volume systems.

Recommended open-source monitoring stack:

  • Evidently AI — comprehensive ML monitoring including data drift, model performance, and data quality reports. Generates HTML monitoring reports that can be stored as compliance records.
  • NannyML — specialised in performance estimation without ground truth labels, using confidence-based estimation to detect concept drift before ground truth data is available. Particularly useful for systems with delayed feedback loops.
  • Grafana — dashboard infrastructure for visualising monitoring metrics with configurable alerting on threshold breaches. Connects to most monitoring data sources and provides the operator-facing view of system health.

Configure alert thresholds at two levels: a warning level (10–15 % degradation from baseline, typically) that triggers internal review without immediate corrective action; and a critical level (25–30 % degradation, or any breach of the documented compliance boundary) that triggers the corrective action workflow and potentially the Article 73 incident assessment.

Component 3: Incident Capture and Classification

Article 73 serious incident reporting requires a defined workflow for capturing, assessing, and reporting incidents. This workflow must be operational before deployment — an incident that occurs before the workflow is in place cannot be reported within the 15-day window if the first 10 days are spent building the reporting process.

Incident CategoryArticle 73 ClassificationReporting DeadlineReport Recipient
Death or life-threatening injury directly attributable to AI system outputSerious incident — life-threatening3 days from awarenessNational Market Surveillance Authority
Serious health impairment, property damage, or environmental harm attributable to AI systemSerious incident — non-life-threatening15 days from awarenessNational Market Surveillance Authority
Serious disruption of critical infrastructure operationsSerious incident — infrastructure15 days from awarenessNational MSA + relevant sector regulator
Performance degradation triggering Technical File non-complianceNon-serious malfunction — compliance breachNot externally reportable; requires corrective action and Technical File updateInternal corrective action + deployer notification

Component 4: Corrective Action Governance

The corrective action governance process closes the loop from monitoring signal to documented response. Every alert that triggers a review must produce one of three documented outcomes: (1) root cause identified — corrective action initiated, timetabled, and tracked to completion; (2) investigation inconclusive — investigation extended with documented reasoning; or (3) false alarm — documented explanation of why the metric variation does not represent a compliance issue.

Critically, corrective actions must be linked back to the Technical File. Where a corrective action changes the system’s performance characteristics, architecture, or operational parameters, the Technical File must be updated to reflect the current system state. Undocumented corrective actions that change the system’s documented behaviour create a divergence between the Technical File and the actual system — which is itself a compliance failure.

Practical Challenges: Monitoring Without Ground Truth

The most significant practical challenge in post-market monitoring for many high-risk AI systems is the ground truth problem. To measure whether a model’s predictions are correct, you need to know the actual outcomes — but for many consequential decisions, those outcomes are delayed, partially observable, or systematically biased by the model’s own decisions (if the model approves fewer loans, you cannot observe the default rates of the loans it rejected).

Practical approaches to the ground truth challenge:

  • Delayed ground truth collection: For systems with observable outcomes at defined lags (hiring decisions with 6-month performance reviews, credit decisions with 90-day default observation), build systematic ground truth collection into your operational process and feed it back to your monitoring pipeline.
  • Confidence-based performance estimation: Tools like NannyML use the model’s own confidence distribution — without requiring outcome labels — to estimate performance changes. A model that was 90 % confident on 80 % of cases and is now 90 % confident on 50 % of cases has likely experienced concept drift even without confirmed outcome data.
  • Override rate monitoring: Human override rates are a proxy ground truth signal. Sustained increases in override rates — operators rejecting AI outputs more frequently — indicate that the model’s recommendations are diverging from operator judgment, which is itself an Article 14 compliance signal as well as a drift indicator.
  • Complaint and appeal tracking: For systems that produce decisions affecting individuals, systematic tracking of complaints and formal appeals provides external signal about performance degradation, particularly in demographic groups that are unlikely to appear in internal monitoring data.

Frequently Asked Questions

What must an EU AI Act post-market monitoring plan contain?

Article 72 requires the post-market monitoring plan to be part of the Quality Management System established under Article 17, and to actively and systematically collect, document, and analyse data to evaluate whether the system continues to comply with Chapter III requirements throughout its lifetime. In practice, a compliant plan includes: the specific performance metrics to be monitored and their baselines; the monitoring frequency and data sources; alert thresholds that trigger internal review; the incident classification and reporting workflow including Article 73 deadlines; the corrective action governance process; and the mechanism for updating the Technical File when corrective actions change the system’s documented characteristics. The plan must be implemented from the system’s launch date, not developed after the first incident.

What is model drift and why does the EU AI Act address it?

Model drift is the degradation of a machine learning model’s real-world performance relative to its validated baseline, caused by changes in the data environment, the population the model serves, or the underlying relationships the model was trained to detect. The EU AI Act’s post-market monitoring requirement directly addresses drift because a model that passed its conformity assessment can degrade into non-compliance over time without any change to the system itself — purely due to environmental change. Article 72’s requirement for systematic monitoring ensures that providers detect and correct this degradation rather than continuing to operate a system that was compliant at launch but has drifted out of compliance in production.

How often must post-market monitoring be conducted?

The Act does not specify a minimum monitoring frequency — it requires monitoring to be “active and systematic,” which regulators interpret as proportionate to the system’s risk level and decision volume. High-volume systems making consequential decisions at scale (credit scoring processing thousands of applications daily, hiring AI screening hundreds of CVs weekly) should be monitored at daily or weekly intervals. Lower-volume systems in less time-sensitive contexts may be appropriately monitored monthly. Regardless of frequency, the monitoring must be continuous in the sense that alerts trigger real-time responses — not batched for a quarterly review that may not happen for months after a drift event begins.

What is a “serious incident” for Article 73 reporting purposes?

Article 3(49) defines a serious incident as any incident that directly or indirectly led or might have led to death, serious harm to health, serious disruption of critical infrastructure, harm to property or the environment (in certain contexts), or a serious breach of fundamental rights or other significant harm to society. The definition is deliberately broad — “might have led to” means near-misses with potential serious consequences are reportable, not just actual harms. If you are uncertain whether an incident meets the Article 73 threshold, the conservative approach is to report: regulators treat good-faith over-reporting more favourably than under-reporting of incidents that later prove serious.

Does deployer post-market monitoring satisfy the provider’s Article 72 obligation?

Partially. Article 72(2) specifically requires providers to solicit relevant data from deployers as part of their post-market monitoring. Deployer-collected operational data (usage patterns, override rates, performance observations, user complaints) is a required input to the provider’s monitoring programme. But the provider cannot outsource the monitoring obligation to deployers — the provider must maintain their own monitoring infrastructure, set and track compliance against the documented thresholds, manage the corrective action process, and report serious incidents to market surveillance authorities. Deployer data supplements provider monitoring; it does not replace it.

Need to build your Article 72 post-market monitoring programme?

Unorma integrates with your ML monitoring stack to generate compliant monitoring reports, manage alert thresholds against your Technical File baselines, and route serious incidents through the Article 73 reporting workflow automatically.Build Your Monitoring Programme →

Share this post

Leave a Reply