EMQN AI Discovery Consultancy
Four-week discovery sprint designing an AI-assisted marking platform for genetic testing laboratories, achieving 93-96% per-criterion accuracy across six languages.
An exam board for genetic testing labs, with seventeen rule-based criteria assessors hand-check on every report.
EMQN is a Manchester-based Community Interest Company that functions as an exam board for laboratories. They run External Quality Assessment schemes for human genetic testing across the globe, processing roughly 20,000 laboratory reports a year across 50+ schemes in six languages. Expert volunteer assessors mark every report against standardised criteria, with at least two assessors per report and discordant results reconciled in moderation meetings.
The clerical accuracy component is 17 objective, rule-based criteria the assessors hand-check on every report. Assessors described it as "the most boring and laborious part" of their work, despite being the least scientifically complex. Three to five minutes per report, multiplied across 20,000 reports, multiplied across volunteers donating their time.
- UKAS accreditation and ISO 27001 in force: data must stay inside UK / EU.
- No US CLOUD Act exposure for any tier of the hosting stack.
- Volunteer assessors keep final say; AI assists, never replaces.
- Six supported languages must score consistently on the same criteria.
- Third-party platform integration has no public API, so the spec carries fallback approaches.
Evidence-based discovery against EMQN's actual reports, in their actual languages, with the licensing constraints written in.
The sprint was structured to expose variance, not hide it. Six frontier AI models were tested across 150 scenarios using 15 representative reports in all six supported languages, with ten independent runs per model. Per-criterion accuracy hit 93-96% on 14 of the 17 criteria; strict accuracy across all seventeen dropped to 45%. That gap is exactly what validated the human-in-the-loop requirement: assist the assessor, do not automate the assessor away.
One uncomfortable finding shaped the model recommendation: Meta's Llama 4 was marginally the most accurate model but could not be deployed by an EU-based organisation under current licensing terms. The recommendation went to the next-best model that an EU-resident UKAS-accredited service could actually run.
Seven cloud providers were analysed for CLOUD Act exposure, GDPR compliance, and operational cost. The traffic-light confidence UI was designed so an assessor can see at a glance which criteria the AI is confident on, which need a closer look, and which the system declined to mark.
The handover split cleanly in two: a strategy report for the board, covering market analysis, process mapping, model evaluation, hosting assessment, success metrics, and the implementation roadmap; and a technical specification for the build team, covering architecture, AI requirements, security framework, UI specifications, testing strategy, and deployment procedures. Alongside both, the package carried UI mockups, an API questionnaire for the third-party platform, and a development quote so the board could commission the build the day after sign-off.
- Six-model benchmark framework with 150 evaluation scenarios across six languages.
- Per-criterion accuracy reporting plus strict-accuracy aggregation across all 17 criteria.
- Seven-provider hosting analysis with sovereignty, cost, and self-hosted alternatives.
- Traffic-light confidence UI mockups for the assessor dashboard.
- API questionnaire and fallback approaches for the closed third-party marking platform.
- Continuous-verification framework so the system re-runs against pre-assessed baselines.
Per-criterion accuracy across the six supported languages.
Six frontier AI models tested against representative reports in English, Spanish, German, French, Italian, and Portuguese. Per-criterion accuracy reaches 93-96%; strict accuracy (all 17 criteria correct on the same report) drops to 45%, which is what validated the human-in-the-loop requirement.
Accuracy (higher is better)
A board-ready investment decision with the evidence to defend it.
Human oversight validated
Per-criterion 93-96% with strict accuracy at 45% put numbers behind the human-in-the-loop requirement. AI assists the assessor; the assessor signs off.
Licensing constraints written in
Meta's Llama 4 was marginally the best model but unusable under EU licensing terms. The recommendation went to a model an EU-resident UKAS service can actually run.
Multilingual strength documented
English and Spanish at 99-100%, the other four supported languages at 93-95%. The board saw the specific accuracy curve before any build started.
Third-party integration de-risked
The closed marking platform has no public API. The spec carries an API questionnaire and named fallback approaches before integration uncertainty becomes integration delay.
Five interviews across the assessment business. No model touched any data until the people did.
Simon Patton
Commercial frame and the volunteer-assessor mission. What "AI assists, not replaces" actually had to mean inside an exam board.
Scheme design
How the seventeen clerical criteria are defined, scored, and reconciled when two assessors disagree on the same report.
Domain perspective
Where assessor fatigue actually hits, which criteria are subjective vs rule-based, and why the boring half is the half worth automating first.
Continuous verification
One-time benchmarking is insufficient for healthcare AI. The system must re-run against pre-assessed baselines on a defined schedule.
Third-party integration
Existing marking platform has no public API documentation. Risk register and API questionnaire scoped from this conversation.
Four observations that shaped the recommendation, alongside the headline accuracy curve.
Human-in-the-loop essential
Per-criterion accuracy at 93-96%; strict accuracy across all 17 criteria at 45%. The gap put numbers behind the assessor-stays-in-the-loop policy.
Licensing eliminates the top model
Meta's Llama 4 was marginally more accurate but cannot be deployed by an EU-based organisation under current licensing. Recommendation went to the next-best model EMQN can actually run.
Multilingual performance is strong
English and Spanish at 99-100%, German, French, Italian, and Portuguese at 93-95%. The minor variations are compound technical terms and diacritical marks.
Continuous verification, not one-time
EMQN's IT Head was emphatic: any healthcare AI system must support ongoing verification against pre-assessed baselines. The recommendation builds that in as a first-class operation.
A board-ready decision package and a build-ready specification, with the artefacts to commission either one.
Strategy report
Market analysis, process mapping, model evaluation, hosting assessment, success metrics, implementation roadmap.
Technical specification
Architecture, AI requirements, security framework, UI specifications, testing strategy, deployment procedures.
Model benchmark dossier
Six frontier models, fifteen representative reports, six languages, ten runs per model. Per-criterion accuracy and licensing notes.
Hosting analysis
Sovereignty assessment with CLOUD Act exposure, GDPR compliance, operational cost modelling, and a self-hosted alternative.
Traffic-light assessor dashboard
Modern dashboard design with confidence colour-coding, PDF viewer integration, and batch processing workflows for volunteer assessors.
Integration risk register
Requirements documented for the closed third-party platform, with named fallback approaches so integration uncertainty stops being a delivery risk.
Working with OpenKit has been a genuinely positive experience. Their team quickly understood the unique challenges of our business and the problem we were trying to solve, and delivered a thorough, evidence-based strategy for our AI-assisted marking platform. We were particularly impressed by their transparent approach, technical expertise, and commitment to long-term partnership and support. I would recommend OpenKit to any organisation seeking a reliable, innovative technology partner.
Simon Patton CEO · EMQN CIC
How we delivered it.
Stack
Capabilities
Compliance
From scoping to live.
- Stakeholder engagementFive in-depth interviews: CEO, Assessment Lead, Scientific Team, IT Head, IT Project Manager. Processes understood from multiple perspectives before any model touched any data. Week 1
- Model benchmarkingSix frontier AI models evaluated across 150 scenarios using 15 representative reports in all six supported languages, with ten independent runs per model. Week 2
- Hosting and sovereigntySeven cloud providers analysed for EU data sovereignty: CLOUD Act exposure, GDPR compliance, operational cost, and self-hosted alternatives. Week 3
- Documentation handoverA strategy report for the board and a technical specification for the build team, alongside UI mockups, an API questionnaire for the third-party platform, and a development quote. Week 4
Bring your team's next AI project to a 30-minute call.
No deck. We listen, sketch a delivery shape, and tell you honestly whether AI is the right tool for the problem.
