EMQN AI Discovery Sprint Case Study | OpenKit
Healthcare Diagnostics · 2025 · 4 weeks
EMQN CIC

EMQN AI Discovery Consultancy

Four-week discovery sprint designing an AI-assisted marking platform for genetic testing laboratories, achieving 93-96% per-criterion accuracy across six languages.

EMQN AI discovery consultancy project showcase
93-96%
Per-criterion accuracy
6
Models, 6 languages, 150 runs
7 providers
EU hosting options analysed
200+ hrs
Projected assessor time / year
Client
EMQN CICHealthcare Diagnostics
Engagement
Evidence-based discovery sprint
Timeline
4 weeks 2025
Capabilities
AI Consulting · Strategy · Healthcare
01The challenge

An exam board for genetic testing labs, with seventeen rule-based criteria assessors hand-check on every report.

EMQN is a Manchester-based Community Interest Company that functions as an exam board for laboratories. They run External Quality Assessment schemes for human genetic testing across the globe, processing roughly 20,000 laboratory reports a year across 50+ schemes in six languages. Expert volunteer assessors mark every report against standardised criteria, with at least two assessors per report and discordant results reconciled in moderation meetings.

The clerical accuracy component is 17 objective, rule-based criteria the assessors hand-check on every report. Assessors described it as "the most boring and laborious part" of their work, despite being the least scientifically complex. Three to five minutes per report, multiplied across 20,000 reports, multiplied across volunteers donating their time.

  • UKAS accreditation and ISO 27001 in force: data must stay inside UK / EU.
  • No US CLOUD Act exposure for any tier of the hosting stack.
  • Volunteer assessors keep final say; AI assists, never replaces.
  • Six supported languages must score consistently on the same criteria.
  • Third-party platform integration has no public API, so the spec carries fallback approaches.
03What we built

Evidence-based discovery against EMQN's actual reports, in their actual languages, with the licensing constraints written in.

The sprint was structured to expose variance, not hide it. Six frontier AI models were tested across 150 scenarios using 15 representative reports in all six supported languages, with ten independent runs per model. Per-criterion accuracy hit 93-96% on 14 of the 17 criteria; strict accuracy across all seventeen dropped to 45%. That gap is exactly what validated the human-in-the-loop requirement: assist the assessor, do not automate the assessor away.

One uncomfortable finding shaped the model recommendation: Meta's Llama 4 was marginally the most accurate model but could not be deployed by an EU-based organisation under current licensing terms. The recommendation went to the next-best model that an EU-resident UKAS-accredited service could actually run.

Seven cloud providers were analysed for CLOUD Act exposure, GDPR compliance, and operational cost. The traffic-light confidence UI was designed so an assessor can see at a glance which criteria the AI is confident on, which need a closer look, and which the system declined to mark.

The handover split cleanly in two: a strategy report for the board, covering market analysis, process mapping, model evaluation, hosting assessment, success metrics, and the implementation roadmap; and a technical specification for the build team, covering architecture, AI requirements, security framework, UI specifications, testing strategy, and deployment procedures. Alongside both, the package carried UI mockups, an API questionnaire for the third-party platform, and a development quote so the board could commission the build the day after sign-off.

  • Six-model benchmark framework with 150 evaluation scenarios across six languages.
  • Per-criterion accuracy reporting plus strict-accuracy aggregation across all 17 criteria.
  • Seven-provider hosting analysis with sovereignty, cost, and self-hosted alternatives.
  • Traffic-light confidence UI mockups for the assessor dashboard.
  • API questionnaire and fallback approaches for the closed third-party marking platform.
  • Continuous-verification framework so the system re-runs against pre-assessed baselines.
Benchmark

Per-criterion accuracy across the six supported languages.

Six frontier AI models tested against representative reports in English, Spanish, German, French, Italian, and Portuguese. Per-criterion accuracy reaches 93-96%; strict accuracy (all 17 criteria correct on the same report) drops to 45%, which is what validated the human-in-the-loop requirement.

  1. Strict accuracy (all 17 criteria)Why human oversight stays in the loop 45%
  2. German, French, Italian, PortugueseCompound technical terms and diacritics 93-95%
  3. English and SpanishHighest-confidence languages 99-100%

Accuracy (higher is better)

04Outcomes

A board-ready investment decision with the evidence to defend it.

93-96%
Per-criterion accuracy
6
Models, 6 languages, 150 runs
7 providers
EU hosting options analysed
200+ hrs
Projected assessor time / year

Human oversight validated

Per-criterion 93-96% with strict accuracy at 45% put numbers behind the human-in-the-loop requirement. AI assists the assessor; the assessor signs off.

Licensing constraints written in

Meta's Llama 4 was marginally the best model but unusable under EU licensing terms. The recommendation went to a model an EU-resident UKAS service can actually run.

Multilingual strength documented

English and Spanish at 99-100%, the other four supported languages at 93-95%. The board saw the specific accuracy curve before any build started.

Third-party integration de-risked

The closed marking platform has no public API. The spec carries an API questionnaire and named fallback approaches before integration uncertainty becomes integration delay.

Stakeholder programme

Five interviews across the assessment business. No model touched any data until the people did.

CEO

Simon Patton

Commercial frame and the volunteer-assessor mission. What "AI assists, not replaces" actually had to mean inside an exam board.

Assessment Lead

Scheme design

How the seventeen clerical criteria are defined, scored, and reconciled when two assessors disagree on the same report.

Scientific Team

Domain perspective

Where assessor fatigue actually hits, which criteria are subjective vs rule-based, and why the boring half is the half worth automating first.

IT Head

Continuous verification

One-time benchmarking is insufficient for healthcare AI. The system must re-run against pre-assessed baselines on a defined schedule.

IT Project Manager

Third-party integration

Existing marking platform has no public API documentation. Risk register and API questionnaire scoped from this conversation.

Findings the benchmark surfaced

Four observations that shaped the recommendation, alongside the headline accuracy curve.

Human-in-the-loop essential

Per-criterion accuracy at 93-96%; strict accuracy across all 17 criteria at 45%. The gap put numbers behind the assessor-stays-in-the-loop policy.

Licensing eliminates the top model

Meta's Llama 4 was marginally more accurate but cannot be deployed by an EU-based organisation under current licensing. Recommendation went to the next-best model EMQN can actually run.

Multilingual performance is strong

English and Spanish at 99-100%, German, French, Italian, and Portuguese at 93-95%. The minor variations are compound technical terms and diacritical marks.

Continuous verification, not one-time

EMQN's IT Head was emphatic: any healthcare AI system must support ongoing verification against pre-assessed baselines. The recommendation builds that in as a first-class operation.

Documentation handover

A board-ready decision package and a build-ready specification, with the artefacts to commission either one.

For the board

Strategy report

Market analysis, process mapping, model evaluation, hosting assessment, success metrics, implementation roadmap.

For the build team

Technical specification

Architecture, AI requirements, security framework, UI specifications, testing strategy, deployment procedures.

150 runs

Model benchmark dossier

Six frontier models, fifteen representative reports, six languages, ten runs per model. Per-criterion accuracy and licensing notes.

7 providers

Hosting analysis

Sovereignty assessment with CLOUD Act exposure, GDPR compliance, operational cost modelling, and a self-hosted alternative.

UI mockups

Traffic-light assessor dashboard

Modern dashboard design with confidence colour-coding, PDF viewer integration, and batch processing workflows for volunteer assessors.

API questionnaire

Integration risk register

Requirements documented for the closed third-party platform, with named fallback approaches so integration uncertainty stops being a delivery risk.

In their words

Working with OpenKit has been a genuinely positive experience. Their team quickly understood the unique challenges of our business and the problem we were trying to solve, and delivered a thorough, evidence-based strategy for our AI-assisted marking platform. We were particularly impressed by their transparent approach, technical expertise, and commitment to long-term partnership and support. I would recommend OpenKit to any organisation seeking a reliable, innovative technology partner.

Simon Patton CEO · EMQN CIC
Approach

How we delivered it.

Stack

AI model benchmarking frameworkMultilingual evaluation suiteEU-sovereign hosting analysisTraffic-light confidence UI designThird-party API risk register

Capabilities

AI ConsultingStrategyHealthcare

Compliance

UKAS accreditationISO 27001ISO 9001GDPRNo US CLOUD Act exposure
Engagement

From scoping to live.

  1. Stakeholder engagementFive in-depth interviews: CEO, Assessment Lead, Scientific Team, IT Head, IT Project Manager. Processes understood from multiple perspectives before any model touched any data. Week 1
  2. Model benchmarkingSix frontier AI models evaluated across 150 scenarios using 15 representative reports in all six supported languages, with ten independent runs per model. Week 2
  3. Hosting and sovereigntySeven cloud providers analysed for EU data sovereignty: CLOUD Act exposure, GDPR compliance, operational cost, and self-hosted alternatives. Week 3
  4. Documentation handoverA strategy report for the board and a technical specification for the build team, alongside UI mockups, an API questionnaire for the third-party platform, and a development quote. Week 4

Bring your team's next AI project to a 30-minute call.

No deck. We listen, sketch a delivery shape, and tell you honestly whether AI is the right tool for the problem.

Start Your
AI Project

Thank you for your interest! Enter your project details below and our team will get in contact within 24 hours.

About your AI project

About You

By submitting this form, you confirm that you have read and agree to our privacy policy. We will only use your information to respond to your inquiry.