Business Purpose of Archetypes

Archetypes are a way of adding domain semantics to information models, while avoiding endless growth and maintenance of the latter, as would occur if a textbook approach were used to modelling.

To make the distinction between domain semantics and information models concrete, information models in openEHR, ISO 13606-1, HL7 CDA and more generally in e-health typically define 'clinical data types', such as Quantity (with units, accuracy etc); Coded text; Ordinal (an Integer/symbol conjunction); and various generic clinical data structures, such as 'clinical statement' (denoted by the type Entry in openEHR and ISO 13606-1), 'clinical document', 'report', and so on. Such an information model may contain 50-100 classes, including 30+ classes for the clinical data types, and enables the construction of instance structures corresponding to the various parts and sections of e.g. a clinical encounter note or a hospital discharge summary.

However, neither a class model of this size, nor the capabilities of standard UML can naturally accommodate the explosion or diversity of possible values of instances that can make up a clinical document created in every particular situation (e.g. a specific kind of patient visiting a specialist), nor for the tens of thousands of clinical observations (e.g. 'systolic blood pressure', 'visual acuity', etc, many of them consisting of multiple data points in specific structures), nor the O(1E4) laboratory analyte result types. Further, the size of terminology needed to annotate data items, both 'names' and 'values' in a name/value representation of the data is in the O(1E5) - O(1E6) concepts range, as exemplified by the SNOMED CT and ICDx terminologies.

The above situation applies across most information-rich industries, with varying but generally very large numbers.

Although technically these numerous possible values could just be understood as the specific values that 'happen to occur' in a situation of data creation, it is widely rcognised within mainstream IT that domain data value 'complexes' (co-occurring structures of data) correspond to meaningful patterns that constitute a relatively small fraction of the astronomical number of possible combinations of values within structures. Thus, while some tens to hundreds of thousands of 'clinical statement' patterns would adequately cover nearly all of general medical data recording (i.e. leaving the terminal real world values such as actual blood pressure open, within their respective sanity ranges), the information models in typical use would in fact permit possible instance structures numbering in the order of 1E10 and much higher ranges (defining a class model with 1E20 possible instances is a relatively trivial matter). In other words, most possible data constructions are garbage.

This is akin to the situation in natural language, where meaningful sentences constitute a tiny fraction of possible, grammatically correct word sequences.

It is also widely recognized that mechanisms are needed to enable some sort of domain level 'modelling' or 'templating', to enable the common patterns to be defined, and thus to allow the subsequent development of software or other mechanisms (e.g. pre-built UI forms) to limit the possible instance structures to those that actually make sense. The general need was identified in Martin Fowler’s 1997 book 'Analysis Patterns' [Fowler_1997], in which 'patterns' are illustrated in 'above the line' parts of UML diagrams, but has been known for some decades. It is generally understood that this kind of modelling cannot simply be an extension of the existing software or database schemata. The obvious 'IT reason' for this is that to do so implies endless maintenance and updating of deployed software, and worse, frequent database migration. In systems operating 24x365, and routinely creating Terabytes of data per year, this is not an acceptable approach.

However, there is an arguably more important reason to provide a generic means of modelling domain information patterns: the authors of domain level definitions or models will not be software or database developers, but domain experts of some kind, e.g. physicians or aeronatics engineers. The latter kind of professional will not generally know or care about the programming or modelling languages used by IT people, and will often have their own formalisms, incomprehensible to IT professionals. Further, the semantics they use in their domain will often not be directly representable in simplistic 'formalism' of UML or ER models.

Consequently, most large software products in the health and other domains have some kind of configuration or template building tool(s) that enable modelling of typical domain content patterns, usually as screen form definitions. This partially addresses both problems: domain models are now separate from the software (but not the product), and they can be built by non-IT personnel, assuming a tool with a reasonable user interface.

The problem in mainstream IT has been that no such capability is available independent of particular software products (i.e. specific vendors), concrete visual forms (UI forms, XML Schemas etc) or domains (e.g. process and control systems engineering have domain specific languages) - i.e. even tools that may be technically powerful enough to do the job are buried inside specific products, and are usually targeted to the database schemas of the product.

An important economic factor is that the creation of good quality domain models is time-consuming and expensive, relying as it does on domain experts - typically experienced clinicians, engineers etc - rather than IT staff. If models are created inside a specific product (e.g. a particular hospital information system), and that product is replaced, there is often little appetite or availability of the staff to recreate the work done in authoring the models/templates for the first product, in the new product environment. Multiplied across products, sites, and whole industry verticals, the lack of standard ways of representing models of domain content has become a significant obstruction to the production of high quality information systems. Instead, as each solution is replaced, its domain models usually die with it.

The need for an efficient, formal, and product- and implementation format-independent domain modelling capability is therefore clear, nowhere moreso than in health, where the sheer amount of domain semantics requiring formalisation in health has necessitated the creation in openEHR of the Archetype formalism, used in conjunction with terminologies (i.e. SNOMED CT, LOINC, ICDx and many others).

Two categories of domain content models can be distinguished, respnding to a universal need to be able to represent both use-independent definitions of 'data points' and 'data groups', and use-case dependent definition of 'data sets'. Consider the case of recording patient vital signs. Assume that content models can be defined for 'blood pressure', 'heart rate' and 'blood oxygen'. These definitions need to be independent of specific uses such as patient home measurement, GP encounter, and hospital bedside measurement, since in all these cases, each vital sign is recorded in exactly the same way. However in each case, these vital signs data points are recorded within a larger data set of items that correspond to the health system event occurring, such as a GP patient health checkup or an ED initial assessment.

Thus there are two related needs: to be able to model re-usable domain data items and structures, and secondly, to be able to model the larger use case specific combinations of these generic elements. The alternative would be to create a domain model for every data set and within many of these models, to repeatedly define the same sub-model of recurring content, such as 'blood pressure'.