
As generative AI becomes more capable and widely deployed, familiar questions related to the governance of other transformative technologies have resurfaced. What opportunities, capacities, risks and impacts should be assessed? Who should carry out assessments and at what stages of the technology life cycle? What tests or measurements should be used? And how do you know if the results are reliable?
Recent research and reports from MicrosoftTHE UK AI Security Institute (opens in a new tab), The New York Times (opens in a new tab)And MIT Technology Review (opens in a new tab) have highlighted gaps in how we evaluate AI models and systems. These gaps also provide the fundamental context for recent consensus reports from international experts: the first International AI Security Report (opens in a new tab) (2025) and the Singapore Consensus (opens in a new tab) (2025). Closing these gaps at a pace suited to AI innovation will enable more reliable assessments that can help guide deployment decisions, inform policy and build trust.
Today we are launching a limited series podcast, AI testing and evaluation: lessons from science and industryto share information on areas grappling with testing and measurement questions. Over four episodes, host Kathleen Sullivan speaks with academic experts in genome editing, cybersecurity, drugsAnd medical devices to find out what technical and regulatory measures have helped close assessment gaps and gain public trust.
Spotlight: Event Series
Microsoft Search Forum
Join us for an ongoing exchange of ideas about research in the era of general AI. Watch the first four episodes on demand.
We also share written case studies from experts, as well as high-level lessons we apply to AI. At the end of the podcast series, we’ll offer Microsoft’s deeper thoughts on the next steps toward more reliable and trustworthy approaches to AI assessment.
Lessons from eight case studies
Our research into risk assessment, testing and assurance models in other areas began in December 2024, when Microsoft Responsible AI Office brought together independent experts from the fields of civil aviation, cybersecurity, financial services, genome editing, medical devices, nanoscience, nuclear energy and pharmaceuticals. In bringing this group together, we drew on our own learnings and feedback received on our eBook, Global governance: goals and lessons for AI (opens in a new tab), in which we explored the higher-level objectives and institutional approaches that had been leveraged in the past for cross-border governance.
Although approaches to risk assessment and testing vary widely across case studies, there is a high-level takeaway: assessment frameworks always reflect trade-offs between different policy goals, such as safety, efficiency, and innovation.
Experts from all eight areas noted that policymakers had to weigh tradeoffs when designing assessment frameworks. These frameworks must consider both the limitations of current science and the need for agility in the face of uncertainty. They also agreed that early design choices, often reflecting the “DNA” of the historical moment in which they were made, as cybersecurity expert Stewart Baker described, are important because they are difficult to reduce or undo later.
Strict pre-deployment testing regimes, such as those used in civil aviation, medical devices, nuclear power and pharmaceuticals, provide strong safety guarantees, but can be resource-intensive and slow to scale. These regimes often emerged in response to well-documented failures and rely on decades of regulatory infrastructure and detailed technical standards.
In contrast, areas marked by dynamic and complex interdependencies between the system under test and its external environment, such as cybersecurity and bank stress testing, rely on more adaptive governance frameworks, where testing can be used to generate actionable risk insights rather than primarily serving as a trigger for regulatory enforcement.
Additionally, in the pharmaceutical sector, where interdependencies are at play and emphasis is placed on pre-deployment testing, experts have highlighted a potential trade-off between post-market surveillance of downstream risks and evaluation of effectiveness.
These variations in approaches across domains – stemming from differences in risk profiles, types of technologies, maturity of valuation science, placement of expertise in the assessor ecosystem, and the context in which technologies are deployed, among other factors – also illuminate lessons for AI.
Applying lessons from risk assessment and governance to AI
While no analogy fits perfectly in the AI context, the cases of genome editing and nanoscience offer interesting prospects for general-purpose technologies like AI, where risks vary significantly depending on how the technology is applied.
Experts highlighted the benefits of more flexible governance frameworks tailored to specific use cases and application contexts. In these areas, it is difficult to define risk thresholds and design assessment frameworks in the abstract. Risks become more visible and assessable once the technology is applied to a particular use case and the context-specific variables are known.
This and other insights also helped us distill the essential qualities to ensure testing is a reliable governance tool across the board, including:
- Rigor in defining what is being examined and why it is important. This requires details specification of what is measured and understand how deployment context can affect results.
- Standardization how tests should be performed to obtain valid and reliable results. This requires establishing technical standards that provide methodological guidance and ensure quality and consistency.
- Interpretability test results and how they inform risk decisions. This requires establishing expectations for evidence and improving knowledge about how to understand, contextualize and use test results, while remaining aware of their limitations.
Towards a stronger foundation for AI testing
Establishing a solid foundation for AI evaluation and testing requires efforts to improve rigor, standardization, and interpretability, and to ensure that methods keep pace with rapid technological advances and evolving scientific understanding.
Learning from other general-purpose technologies, this foundational work must also be continued for both AI models and systems. While test models will remain important, reliable evaluation tools that ensure system performance will enable broad adoption of AI, including in high-risk scenarios. A robust feedback loop on assessments of AI models and systems could not only accelerate progress on methodological challenges, but also highlight which opportunities, capabilities, risks, and impacts are most appropriate and effective for assessing at which points in the AI development and deployment lifecycle.
Thanks
We would like to thank the following external experts who contributed to our research program on lessons for AI testing and evaluation: Mateo Aboy, Paul Alp, Gerónimo Poletto Antonacci, Stewart Baker, Daniel Benamouzig, Pablo Cantero, Daniel Carpenter, Alta Charo, Jennifer Dionne, Andy Greenfield, Kathryn Judge, Ciaran Martin, and Timo Minssen.
Case studies
Civil aviation: Testing in aircraft design and manufacturingby Paul Alp
Cybersecurity: Cybersecurity Standards and Testing: Lessons for AI Safety and Securityby Stewart Baker
Financial services (banking stress tests): The evolution of the use of banking stress testsby Kathryn Judge
Genome editing: Governance of genome editing in human and agricultural therapeutic applicationsby Alta Charo and Andy Greenfield
Medical devices: Medical device testing: regulatory requirements, evolution and lessons for AI governance, by Mateo Aboy and Timo Minssen
Nanosciences: The regulatory landscape of nanoscience and nanotechnology and their applications to future AI regulationby Jennifer Dionne
Nuclear energy: Testing in the nuclear industryby Pablo Cantero and Gerónimo Poletto Antonacci
Drugs: The history and evolution of tests in pharmaceutical regulationby Daniel Benamouzig and Daniel Carpenter
