OMOP
Central Position
The most important point is that OHDSI/OMOP should be separated into three different things:
Open-source tools for data management and data analysis.
This is clearly good. The tools can and should be evaluated. But data management is often highly specific to local data, and good reproducible data management often means the final analysis can be a short, conventional model command in R, Stata, or Python. Useful tools do not imply that the whole OMOP stack should be adopted.
Federated analyses across multiple territories.
This is directionally good and worth participating in once the core OpenSAFELY work is shipped. It should not be the immediate priority. Very few territories have data as rich as OpenSAFELY’s source data, and OMOP-style common data models tend to reduce datasets to a lowest common denominator. That can leave scant information per participant.
Using lots of analytic approaches and pooling the outputs.
This is interesting methodologically, but it is not the near-term mission. The priority is straightforward, robust analyses with face validity that can rapidly and credibly inform decision makers and the wider community. Trialling novel analytic approaches is secondary.
Nothing in the later material clearly contradicts this position. The later discussions mostly reinforce it: OHDSI has useful tools and a serious ecosystem, but OMOPification is expensive, lossy, and often over-sold as if it removes hard data problems rather than moving them into mapping, maintenance, and hidden judgement calls.
Short Version
OMOP is not useless, but it is over-sold. It is useful as an ecosystem to understand, a set of tools to evaluate, and a possible interoperability target. It should not become the default modelling philosophy for OpenSAFELY.
The right posture is: learn from OHDSI, reuse useful tools and ideas, and support OMOP where there is a concrete use-case. Do not let external excitement about OMOP turn into open-ended OMOPification work without a clear research purpose, maintenance plan, and funding.
1. What Is Worth Liking
OHDSI’s open-source culture is a real positive. Data-management tooling, analysis packages, data-quality checks, cohort-builder concepts, and documentation are all worth reviewing. There is no reason to reject useful work just because it comes from the OMOP ecosystem.
ATLAS may be particularly worth understanding because GUI-based cohort definition is a genuine user-interface advantage if it works well. HADES, PatientLevelPrediction, IncidencePrevalence, and DataQualityDashboard may also contain useful ideas, but their actual uptake, usability, and reliability need checking rather than assuming.
Federated analysis is also attractive in principle. Cross-territory analysis can answer questions that single-country or single-dataset work cannot. But federation only pays off when the participating datasets are good enough, the common definitions are meaningful, and the analysis is not reduced to something too thin to be useful.
2. Why OMOP Should Not Become The Default Strategy
OMOP is often treated as a magic solution to hard data problems. That is wrong. Mapping heterogeneous EHR data into a common model involves many judgement calls, compromises, local-code issues, inferred fields, granularity loss, and ongoing maintenance work.
If those decisions are made by suppliers or consultants and hidden inside an ETL process, downstream researchers get a cleaner interface but lose visibility into the assumptions that matter. That is a poor fit for OpenSAFELY’s strengths: explicit code, visible curation, composable definitions, and inspectable implementation choices.
Common data models also tend to flatten rich source data. That may be an acceptable trade-off for some international comparisons, but it is not automatically a good trade-off for English primary care or other rich local datasets. Where OpenSAFELY has better source data than many other territories, reducing it to the lowest common denominator can destroy value.
3. The Best Interoperability Story
The best “we do OMOP” story is probably not “convert all OpenSAFELY data into OMOP”. It is “ehrQL can query OMOP-compliant databases where that is useful”.
An OMOP backend for ehrQL would let OpenSAFELY participate in OMOP contexts without making OMOP the native model. Where a database is already OMOP-compliant, ehrQL could extract research-ready datasets in the usual way. This is a pragmatic interoperability route.
A strong variables library would make this much more compelling. Reusable definitions for things like ethnicity, QOF variables, or other standard constructs could run across OpenSAFELY-native backends and OMOP backends. That would be a stronger contribution than simply exposing another OMOP database.
There may also be value in OMOPifying actual source GP data using OpenSAFELY tools at some point. But done seriously, this would quickly expose hard edges between OMOP and event-level EHR data. Those mismatches would need documenting or fixing rather than glossing over.
4. What OMOP Hides
The core risk is hidden complexity. OMOP mappings can hide:
- assumptions about how source events map to OMOP concepts;
- loss of granularity;
- inferred visits or event structures;
- missing vendor-specific local codes;
- wrong or surprising code mappings;
- local data-collection context;
- ETL decisions that are not easy to publish or inspect;
- maintenance gaps when mappings become stale.
This matters because the mapped data can look standardised even when the underlying decisions are contestable. A standard schema does not make the judgement calls disappear.
5. Maintenance Is The Hard Part
OMOPification is not a one-off transformation. A static mapping is a liability. Source schemas, vocabularies, local codes, and research needs change, so mappings need continuing ownership, documentation, testing, and quality control.
The UK Biobank example is an important warning. An OMOP derivation dataset was restricted because the mapping was static and not maintained. Reported problems included missing local codes, wrong mappings, multiple deaths for some participants, and limited ability to explain the mapping because of proprietary conversion work.
The genomics/cancer-record examples point in the same direction. OMOP work can force an organisation to catalogue its data properly, which is a real benefit. But it can also require inferred visits, loss of granularity, and hidden ETL decisions. The value of the investment is not proven until real users can use the mapped data successfully.
6. Data Quality And Contracts
OHDSI’s data-quality work is mature prior art. The Kahn-style terminology and OHDSI DataQualityDashboard checks are worth understanding, especially around conformance, completeness, plausibility, incompatible distributions, and related checks.
The data-quality activities needed by OpenSAFELY and OHDSI overlap substantially. The modelling philosophy differs.
OMOP is essentially “one schema for everyone”. OpenSAFELY contracts are closer to “one size will not fit all, but here are the minimal expectations and useful extensions”. That is a better fit when data sources differ and when local context matters.
For OpenSAFELY contracts, the preferred direction is the smallest required field set, with optional data available where backends can provide it. Researchers should be able to see what each backend supports. Extra fields can live in extensions or separate tables. Making backend completeness visible could also incentivise suppliers to improve coverage.
7. Modelling Differences Matter
OMOP is SQL-oriented and relatively normalised. OpenSAFELY tends towards constrained, researcher-facing structures that are sometimes more denormalised. That constraint is a benefit: it reduces the number of ways researchers can do the wrong thing.
OMOP’s treatment of sex/gender is a useful example of why OMOP should inform, not govern, OpenSAFELY modelling choices. Some OMOP decisions may be reasonable for particular analytic purposes but still be poor standards to inherit blindly.
Patient identifiers, registration tables, practice data, clinical events, inpatient spells, and optional fields all show the same issue: the hard work is not choosing a standard name. The hard work is deciding what a field means, which fields are essential, which are optional, and how much variation to allow between backends.
8. External Pressure Is Real But Not A Strategy
National bodies, HDR-linked programmes, regulators, SDEs, and external audiences ask about OMOP because it is visible in policy and real-world evidence circles. That does not mean OpenSAFELY should accept OMOPification as the strategic direction.
Some HDR-linked work appears to be pushing broad OMOPification, including English primary care data. That looks expensive and weakly prioritised if it means “OMOP everything” rather than “OMOP the bits needed for a concrete use-case, then evaluate whether it helped”.
There is frustration that previous harmonisation efforts received large resources but left little public trace of outputs, failures, or lessons. More OMOP work should not proceed as if those lessons do not matter.
The right response to external pressure is not obstruction. It is clarity: OMOP can be useful, but only with a clear purpose, realistic costing, visible mapping decisions, and ongoing maintenance.
9. Regulator, SDE, And Commercial Questions
OMOP questions often arrive bundled with other practical questions: broad approvals, rapid surveillance, private companies, unpublished outputs, deployment into SDEs, and how OpenSAFELY fits with national or local infrastructure.
OpenSAFELY needs concise answers on these. Redeployment is not just installing software. Depending on data similarity, it likely needs a small skilled group for several months, followed by ongoing support. A rough minimum estimate in the source material was four people for three to six months, and at least GBP 500k per year to keep the capability alive.
For sensitive pre-publication work, private code and careful project/action naming may be needed because summaries and some artefacts can be visible. This is not an OMOP issue directly, but it matters in the same regulator-facing real-world evidence context where OMOP questions arise.
10. Standards Language Can Obscure Delivery
“Federation”, “CDM”, “TRE standards”, and similar language can sound precise while hiding unclear delivery models. Some diagrams and proposals were hard to interpret even after discussion.
The useful question is always practical: what data moves, what code runs, where are the assumptions documented, who maintains the mapping, what does the researcher see, and what concrete analysis becomes possible?
Practical Line To Use
OMOP can be useful for cross-dataset and cross-country work, and OHDSI has tools worth evaluating. But mapping to OMOP is expensive and makes important judgement calls. OpenSAFELY’s core value is making those decisions explicit in code and reusable definitions. Where there is a concrete use-case, an OMOP backend for ehrQL is a sensible interoperability route. Open-ended OMOPification is not.
Open questions
- Which OHDSI/HADES tools are actually used in published studies, and how often?
- Are HADES, ATLAS, PatientLevelPrediction, IncidencePrevalence, and DataQualityDashboard genuinely reusable outside their home ecosystem?
- What would be the smallest valuable OMOP backend for ehrQL?
- Which OpenSAFELY variables would prove the value of backend-portable definitions?
- What is the real cost of maintaining a high-quality OMOP mapping over time?
- Where is OMOP genuinely necessary: international studies, regulator-facing workflows, specific SDE requirements, or only particular datasets?
- What public lessons exist from past UK harmonisation programmes, and are they being used?
External resources
- HDR data partner funding call: https://www.hdruk.ac.uk/funding-opportunities/cff-for-data-partners-to-join-a-uk-real-world-evidence-network/
- HDR real-world evidence coordination centre call: https://www.hdruk.ac.uk/news/call-for-funding-to-establish-a-real-world-evidence-network-co-ordination-centre-for-the-uk/
- OHDSI PatientLevelPrediction: https://ohdsi.github.io/PatientLevelPrediction/
- DARWIN EU IncidencePrevalence: https://darwin-eu.github.io/IncidencePrevalence/
- PatientLevelPrediction custom cohorts vignette: https://ohdsi.github.io/PatientLevelPrediction/articles/BuildingPredictiveModels.html#custom-cohorts
- UK Biobank OMOP derivation dataset restriction: https://community.ukbiobank.ac.uk/hc/en-gb/articles/24332548208413-Restriction-of-OMOP-derivation-dataset
- OHDSI Common Data Model overview: https://www.ohdsi.org/data-standardization/the-common-data-model/
- OHDSI Common Data Model docs: https://ohdsi.github.io/CommonDataModel/
- OHDSI gender vocabulary page: https://www.ohdsi.org/web/wiki/doku.php?id=documentation:vocabulary:gender
- OHDSI visit occurrence table: https://ohdsi.github.io/CommonDataModel/cdm54.html#VISIT_OCCURRENCE
- OMOP questionnaire: https://www.surveymonkey.co.uk/r/NJY5TVT
- BHF Data Science Centre video: https://www.youtube.com/watch?v=WPv-AMA4S7c
- Data-quality terminology paper: https://pubmed.ncbi.nlm.nih.gov/27713905/
- OHDSI DataQualityDashboard: https://github.com/OHDSI/DataQualityDashboard
- DataQualityDashboard check descriptions: https://ohdsi.github.io/DataQualityDashboard/articles/CheckTypeDescriptions