Business Purpose of Archetypes

Archetypes are a way of adding domain semantics to information models, while avoiding endless growth and maintenance of the latter, as would occur if a textbook approach were used to modelling.

To make the distinction between domain semantics and information models concrete, we can consider the domain of e-health as an example. Published information models such as those from openEHR, ISO 13606-1, HL7 CDA and others use a multi-level approach in which the main information model is separate from another layer of models that express domain semantics. The information models typically define such things as:

'clinical data types', e.g. Quantity (with units, accuracy etc), Coded text, Ordinal (an Integer/symbol conjunction);
various generic clinical data structures, such as 'clinical statement' (denoted by the type Entry in openEHR and ISO 13606-1), 'clinical document', 'report', and so on;
various infrastructure types to do with identification, versioning and so on.

Such an information model may contain 50-100 classes, including 30+ classes for the clinical data types, and enables the construction of instance structures corresponding to the various parts and sections of e.g. a clinical encounter note or a hospital discharge summary.

However, neither a class model of this size, nor the capabilities of standard UML can naturally accommodate the number or diversity of possible values of instances that can make up a clinical document created in every particular situation (e.g. a specific kind of patient visiting a specialist), nor the tens of thousands of clinical observations (e.g. 'systolic blood pressure', 'visual acuity', etc, many of them consisting of multiple data points in specific structures), nor the O(1E4) laboratory test result analyte types. Further, the size of terminology needed to annotate data items, both 'names' and 'values' in a name/value representation of the data is in the O(1E5) - O(1E6) concepts range, as exemplified by the SNOMED CT, ICDx and dozens of other clinical terminologies.

The above situation applies across most information-rich industries, with varying but generally very large numbers.

Although technically these numerous possible values could just be understood as the specific values that 'happen to occur' in a situation of data creation, it is widely recognised within mainstream IT that domain data value 'complexes' (co-occurring structures of data) correspond to meaningful patterns that constitute a small fraction of the astronomical number of possible combinations of values within structures. Thus, while some tens to hundreds of thousands of 'clinical statement' patterns would adequately cover nearly all of general medical data recording (i.e. leaving the terminal real world values such as actual blood pressure open, within their respective sanity ranges), the information models in typical use would in fact permit possible instance structures numbering in the order of 1E10 and much higher ranges - defining a class model with 1E20 possible instances is a relatively trivial matter. In other words, most possible instance data constructions from a typical information model are garbage.

This is akin to the situation in natural language, where meaningful sentences constitute a tiny fraction of possible, grammatically correct word sequences.

Figure 1. Basic model maths

In the figure above, the 'meaningful instance space' (shown in aqua) corresponds to the actual data of interest in any information system, and yet the only way to model it according to the classic textbook approach is simply by growing the information model. The problem with this is that the information model is generally the basis of the deployable software, and also the database schemas, and so constant changes imply constant system instability.

It is widely recognized that other mechanisms are needed to enable some sort of 'templating' of the 'meaningful instance space', to enable common patterns to be defined. Doing so can aid the subsequent development of software or other artefacts (typically UI forms) to be based on instance structures that are actually known to occur. The general need was described in Martin Fowler’s 1997 book 'Analysis Patterns' [1], in which 'patterns' are illustrated in 'above the line' parts of UML diagrams, but has been known for some decades. It is generally understood that this kind of modelling cannot simply be an extension of the existing software or database schemata. The obvious IT reason for this is that to do so implies endless maintenance and updating of deployed software, and worse, frequent database migration. In systems operating 24x365, and routinely creating Terabytes of data per year, this is not an acceptable approach.

However, there is an arguably more important reason to provide a generic means of modelling domain information patterns: the authors of domain level definitions or models will not be software or database developers, but domain experts of some kind, e.g. physicians or aeronautical engineers. The latter kind of professional will not generally know or care about the programming or modelling languages used by IT people, and will often have their own formalisms, incomprehensible to IT professionals. Further, the semantics they use in their domain will often not be directly representable in the comparatively simplistic formalisms of UML or ER models.

Consequently, some large software products in the health and other domains have a configuration or template building tool(s) that enable modelling of typical domain content patterns, usually as screen form definitions. This partially addresses both problems: some domain semantics are now separate from the software, and they can be built by non-IT personnel using dedicated tools.

Figure 2. Screen templates: a useful stop-gap

There are of course many limitations of this kind of approach: it doesn’t model all of the 'meaningful instance space', it is typically tied to user interface specifics; it doesn’t stop the creation of nonsense data instances.

The main problem however, common to all of mainstream IT, has been that no such modelling capability is available independent of the particular vendor product and its visual forms. Ideally it would be possible to do such modelling in a standard way for all domains, i.e. modelling domain patterns would be as generic as UML or ER modelling are. Unfortunately today, even the most advanced tools that may be technically powerful enough to do the job are buried inside specific products and are tied to the corresponding proprietary data models.

An important economic factor is that the creation of good quality domain models is time-consuming and expensive, relying as it does on domain experts - typically experienced clinicians, engineers etc - rather than IT staff. If models are created inside a specific product (e.g. a particular hospital information system), and that product is replaced, there is often little appetite or availability of the staff to recreate the work done in authoring the models/templates for the first product, in the new product environment. Multiplied across products, sites, and whole industry verticals, the lack of standard ways of representing models of domain content has become a significant obstruction to the production of high quality information systems. Instead, as each solution is replaced, its domain models usually die with it.

The need for an efficient, formal, and product- and implementation format-independent domain modelling capability is therefore clear. In health, where the sheer amount of domain semantics requiring formalisation makes both the classic single-model approach and simplistic screen templating non-scalable, other methods have had to be developed. In openEHR, this took the form of the Archetype formalism, used in conjunction with terminologies including SNOMED CT, LOINC, ICDx and many others.

Two categories of domain content models can be distinguished, responding to a universal need to be able to represent both use-independent definitions of 'data points' and 'data groups', and use-case dependent definitions of 'data sets'. Consider the case of recording patient vital signs. Assume that content models can be defined for 'blood pressure', 'heart rate' and 'blood oxygen'. These definitions need to be independent of specific uses such as patient home measurement, GP encounter, and hospital bedside measurement, since in all these cases, each vital sign is recorded in exactly the same way. However in each case, these vital signs data points are recorded within a larger data set of items that correspond to the health system event occurring, such as a GP patient health checkup or an ED initial assessment.

Thus there are two related requirements of the domain modelling formalism: to be able to model re-usable domain data items and structures, and secondly, to be able to model the larger use case specific combinations of these generic elements. The alternative would be to create a domain model for every data set and within many of these models, to repeatedly define the same sub-model of recurring content, such as 'blood pressure'.

References

[1] M. Fowler, Analysis Patterns: Reusable Object Models. Addison Wesley, 1997.